Intro to OLS Regression
POLS 3316: Statistics for Political Scientists

Tom Hanna

2023-11-14

Linear Regression - Ordinary Least Squares Regression

\(y = \alpha + \beta X + \epsilon\)

  • What is the purpose of regression?
  • What is required to do OLS regression?
  • How does OLS work?

What is the purpose of regression?

\(y = \alpha + \beta X + \epsilon\)

  • To mathematically establish the relationship between our X and Y variables.

\[\\[5in]\]

What is the purpose of regression?

\(y = \alpha + \beta X + \epsilon\)

  • To mathematically establish the relationship between our X and Y variables.
  • To draw a line.

What is the purpose of regression?

\(y = \alpha + \beta X + \epsilon\)

  • To mathematically establish the relationship between our X and Y variables.
  • To draw a line.
  • To reveal our expectation of Y given X

What is the purpose of regression?

  • To draw the line that most closely resembles the real relationship between X and Y variables

The regression model

\(y = \alpha + \beta X + \epsilon\) is our abstract model

Technically, regression gives us:

\(E[y] = \alpha + \beta X + \epsilon\)

where E[y] is our expectation of y given X.

E[y] may also be called \(\hat{y}\)

The regression model

We want to minimize the distance between the actual data and the predicted, \(\hat{y}\), values for each observation.

How does OLS work?

\(y = \alpha + \beta X + \epsilon\)

Ordinary Least Squares Regression

  • This is another case of squared differences:

      + We did squared differences from the mean to get variance 
      + We used squared differences in the $X^2$ test
  • The differences in this case are the distance between the actual data points and the predicted location of Y based on X.

A graphical (video) look

Minimizing the distance

The Least Squares Regression Line | A Demo Video from Statistics: An Animated Journey

We’ll watch Up to about 3:00 in class, but you can watch the whole thing for more detail.

Ordinary Least Squares Regression

  • OLS is the method that minimizes the sum of the squared distances from the data points to the line - the method of least squares

\(y = \alpha + \beta X + \epsilon\)

  • So how do we get there?

What is required to do OLS?

\(y = \alpha + \beta X + \epsilon\)

The Four (five, seven?) assumptions of linear regression

  1. Linearity
  2. Normality
  3. Independence
  4. Homoskedasticity

Two are arguably consequences of the others and the last doesn’t apply with only one X variable.

    - 5. Mean error is zero
    - 6. Error term observations are independent
    - 7. No perfect multicollinearity

What is required to do OLS?

\(y = \alpha + \beta X + \epsilon\)

  1. Linearity - X and Y have a linear relationship.

What is required to do OLS?

The Assumptions of Linear Regression

\(y = \alpha + \beta X + \epsilon\)

  1. Linearity - X and Y have a linear relationship.

  2. Normality - For any value of X, Y is normally distributed.

             - We're in a random world
             - So, X won't predict Y with precision
             - X should predict Y according to a random, normal distribution
             - The residuals are normally distributed

The Assumptions of Linear Regression

\(y = \alpha + \beta X + \epsilon\)

  1. Linearity - X and Y have a linear relationship.
  2. Normality - For any value of X, Y is normally distributed.
  3. Independence - The observations are independent of each other.

The Assumptions of Linear Regression

\(y = \alpha + \beta X + \epsilon\)

  1. Linearity - X and Y have a linear relationship.

  2. Normality - errors are normally distributed.

  3. Independence - The observations are independent of each other.

  4. Homoskedasticity - The variance of the residual (\(\epsilon\)) is constant.

             + The error term is the same for any value of X as any other
             + 2 told us the errors are normally distributed. The variance of that distribution is independent of the value of X. 
             + The opposite of homoskedasticity is heteroskedasticity and it is bad

Assumptions of Linear Regression

\(y = \alpha + \beta X + \epsilon\)

  1. Linearity - X and Y have a linear relationship.
  2. Normality - errors are normally distributed.
  3. Independence - The observations are independent of each other.
  4. Homoskedasticity - The variance of the residual (\(\epsilon\)) is constant.

You may hear about the Gaussian assumptions of OLS

  • Gaussian is another word for the normal distribution
  • The normality assumption and homoskedasticity assumption aren’t necessary to fit a line
  • The normality assumption is necessary to prove that the OLS method is the most efficient, unbiased estimator of the line
  • This has been mathematically proven as well as confirmed by simulation

What if the assumptions are violated?

  • Linearity - we can’t draw a line without doing other things to transform the variables.

  • Independence - Have to account for whatever is causing the lack of independence.

  • Homoskedasticity - The precision of the estimates decreases.

  • Normality - The statistical tests are called into question.

  • These are all fixable in many cases, some fairly simply.

Regression Formula

Regression formula

m is the formula to find the intercept b is the formula to find the slope

DON’T PANIC!

Alternate Method: From Correlation Coefficient

There is a “simpler” way to find a regression line that uses the correlation coefficient. But if you had to find the correlation coefficient by hand, you’d have to use this formula:

correlation coefficient formula

\[\\[5in]\]

DON’T PANIC!

Don’t Panic

I’m not asking you do any of thatm but…

It is worth looking at the formulas all together to see some of the relationship:

Don’t Panic

Regression formula

correlation coefficient formula

## Authorship, License, Credits

Creative Commons License