Lecture 3

Correlation: Displacement vs Horsepower

Correlation: Displacement vs Miles per Gallon

Correlation: Real Axle Ratio vs Quarter Mile Time

Correlation Example: nonlinear

OLS Regression

Our goal: to estimate the causal effect of some treatment
- e.g. how does welfare reform impact poverty rates?
- Does getting the flu vaccine improve your health?
- Does increasing teacher salary improve students outcomes?
- Do cats respond to cat music?
  - This is a real academic paper.

OLS Regression: Causal Relationships

Even simple causal models can have very complicated formulas. Think physics calculations
We need to simplify this model to get anywhere. We assume a linear relationship
e.g. student grades increase (or decrease) linearly with teacher salary
- Going from $20,000/year to $21,000/year has the same effect on grades as going from $150,000/year to $151,000/year
- We relax this assumption later

A Roadmap of What’s to Come

Start with mechanics of OLS and how to interpret
- Focus is on describing data
Once we get to multivariate OLS we can control for issues we know cause bias
We finish by engineering specific controls to create “quasi-experiments”

OLS Regression: “The Core Model”

We arrive at what the book calls the core model, though you will not see this terminology elsewhere
- It’s also called the regression equation or estimating equation
It is a generic formula that applies to any relationship between x and y:
$y_i=\beta_0 + \beta_1 x_i + \varepsilon_i$

OLS Regression: Example

Suppose we use the teacher salary example: how do student’s test scores relate to teacher salaries on average?
We can rename our “core model”:
- $grade_i = \beta_0 + \beta_1 salary_i + \varepsilon_i$
How do we interpret $\beta_1$? $\beta_0$?

OLS Regression: iClicker

$grade_i = \beta_0 + \beta_1 salary_i + \varepsilon_i$
Salary is in dollars. Grade is in GPA scale (1-4). Interpret $\beta_1$
A A 1 unit increase in GPA is associated with a $\beta_1$ dollar increase in teacher salary, on average
B A 1 dollar increase in teacher salary is associated with a $\beta_1$ unit increase in student GPA, on average
C A $\beta_1$ unit increase in GPA is associated with a 1 dollar increase in teacher salary, on average
D A $\beta_1$ dollar increase in teacher salary is associated with a 1 unit increase in student GPA, on average

OLS Regression: “The Core Model”

$y_i=\beta_0 + \beta_1 x_i + \varepsilon_i$
- $y_i$ is our outcome (dependent) variable for individual i. Here it’s student i’s grade
- $\beta_0$ is the intercept. If we take our model seriously it’s the average grade for a student whose teacher has 0 salary
- $x_i$ is the independent variable for individual i. Here it is student i’s teacher’s salary
- $\beta_1$ is the slope. This is the actual (average) causal effect of increasing x by 1 unit on y

OLS Regression: “The Core Model”

$y_i=\beta_0 + \beta_1 x_i + \varepsilon_i$
$\varepsilon_i$ is the error term. It captures every factor not included in our model (which is a lot of things!)
- Examples: student intelligence. Other student characteristics (e.g. demographics). Whether the student is interested in a subject. Whether the student slept in for an exam.
- It has mean value of 0

Sources of variation

Suppose we talk about “the” causal effect of increasing teacher salary on grades. Consider the following ways that teacher salary can increase:
- A more qualified teacher is hired
- A bonus is paid based on performance
- Teachers with low pay are laid off and class sizes increase
- There is a shortage of teachers, driving up salaries
- Salaries must be increased due to bad working conditions
- Hours are increased, so salary is also increased

What we’re measuring

We’re measuring the average association between the two variables in the data
- We’re getting a mix of all of the possible reasons why different teachers have different salarys (and students have different grades)
- The weightings are based on the population and sample we use
This is called a reduced form estimate.
Just describing this data is often useful

OLS Regression: Estimating the Core Model

We don’t know the true values of $\beta_0, \beta_1$, so we need to estimate
We end up estimating $y_i = \hat\beta_0 + \hat\beta_1 x_i + \hat\varepsilon_i$
Hats are used to indicate estimates. We know the actual value of x and y, but not anything else

OLS Questions

$grade_i = \beta_0+\beta_1 salary_i + \varepsilon_i$
- $salary_i$ is salary (in thousands of dollars), $grade_i$ is final grade in percent (e.g. 100)

Suppose $\beta_0=60, \beta_1=1$ how do we interpret this?
Student 1 has a teacher who is paid $20000. Calculate $\hat grade_1$
Student 1’s actual grade in class was a 65. What is $\varepsilon_1$?

OLS Questions

$grade_i = \beta_0+\beta_1 salary_i + \varepsilon_i$
- $salary_i$ is salary (in thousands of dollars), $grade_i$ is final grade in percent (e.g. 100)

What is included in $\varepsilon$?
Suppose we estimate $\hat\beta_0=40,\hat\beta_1=2$. What is $\hat grade_1, \hat\varepsilon_1$?
What is included in $\hat\varepsilon$ that is not included in $\varepsilon$?

OLS Regression: Some Observations

Core model: $y_i=\beta_0+\beta_1 x_i +\varepsilon_i$

Key question 1: If the model is this simplified, is it even useful?
From a predictive analytics perspective this is very weak. But as long as some simple assumptions are satisfied (covered later) this efficiently measures an average causal effect .
This is important for policy evaluation. If a union negotiates a salary increase what will happen to the average grade? This is the causal effect.

OLS Regression: Some Observations

Key Question 2: Given our model, how do we calculate $\beta_1$ (and $\beta_0$)?
We can never observe $\beta_1$, but we can estimate it as $\hat\beta$ using ordinary least squares regression
$\hat\beta_1$ is a sample statistic (we’ll calculate later)
We then have to ask if $\hat\beta_1$ is close to $\beta_1$

OLS regression: ideas

Our model is a line, and we have data. We estimate $\beta_0,\beta_1$ by finding the best fit line to the observed data
We can measure the fit using sum of squared errors or mean squared error
$SSE=\sum \varepsilon_i^2=\sum (y_i-\beta_0-\beta_1 x_i)^2$, $MSE=\frac{SSE}{n}$

OLS Regression: Graph

OLS Regression: Fitting

Here we have a scatterplot of data, and the line $y=3+1.1x$. For each point we can calculate the error term, then take the average to get the mean square error.
How do we know what the best fit line is?
Naive: for every possible $\beta_0,\beta_1$ compute the MSE (or SSE), then choose the parameters that give the lowest value (best fit)
- This is actually how many machine learning models work, but they use methods from calculus to make it fast

OLS Regression: Fitting

Why use mean square error and not mean error?
- Note that for any $\beta_1$, we can calculate $\beta_0$ using simple algebra to make the mean 0

Question: Why SSE?

OLS Motivating Example: Student Scores


Call:
lm(formula = grade ~ attendance, data = dt)

Residuals:
    Min      1Q  Median      3Q     Max 
-63.514  -8.129   0.613   9.083  40.188 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept)   54.011      3.703  14.588 < 0.0000000000000002 ***
attendance    38.164      4.377   8.719 0.000000000000000941 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.85 on 206 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.2695,    Adjusted R-squared:  0.266 
F-statistic: 76.02 on 1 and 206 DF,  p-value: 0.0000000000000009408

OLS Regression: How to Estimate

Core model: $y_i=\beta_0+\beta_1 x_i +\varepsilon_i$

Given our model, how do we calculate $\beta_1$ (and $\beta_0$)?
We can never observe $\beta_1$, but we can estimate it as $\hat\beta$ using ordinary least squares regression
We then have to ask if $\hat\beta_1$ is close to $\beta_1$

OLS Regression: How to Estimate

Our model is a line, and we have data. We estimate $\beta_0,\beta_1$ by finding the best fit line to the observed data
We can measure the fit using sum of squared errors or mean squared error . * $SSE=\sum \varepsilon_i^2=\sum (y_i-\beta_0-\beta_1 x_i)^2$, $MSE=\frac{SSE}{n}$

OLS Regression Estimates: Graphical

SSE: 195.1958

OLS Regression Estimates: Graphical

What about now?

SSE: 171.4438

OLS Regression Estimates: Graphical

We need to shift our estimate up so that it’s halfway between the data ($\bar e=0$):

mean(output[[2]]$e)

[1] 3.204127

OLS Regression Estimates: Graphical

Now?

SSE: 17.44736

OLS Regression Estimates: Graphical

Now?!

SSE: 13.18756

OLS Formula

We have a simple formula for $\beta_1$
$\hat\beta_1=\frac{\sum(x_i-\bar x)(y_i-\bar y)}{\sum(x_i-\bar x)^2}$
$=\frac{E[(x-\bar x)(y-\bar y)]}{var(x)}$
$=\frac{cov(x,y)}{var(x)}$
$=\frac{\rho \sigma_y}{\sigma_x}$
(Optional) Matrix notation formula: $\hat\beta=(X'X)^{-1}X'Y$
Note: $\beta_0$ obtained through substitution: $\hat\beta_0=\bar y -\hat\beta_1 \bar x$

OLS Formula

Key question: is $\hat\beta_1\approx\beta_1?$ ie as $n\to\infty$ does $\hat\beta_1\to\beta_1$?
Seems unlikely. This is just correlation, yet we’re trying to measure a causal parameter. Correlation $\neq$ Causation…

Properties of $\hat\beta$

Two important questions have not been addressed
What is the sampling distribution of $\hat\beta$? ie what is its variance. Distribution?
Under what conditions does $\hat\beta\to\beta$? Or more generally when is $\hat\beta$ unbiased : $E[\hat\beta]=\beta$
- Bias is a systematic property.

Unbiasedness

$\beta_1$ is a causal parameter, but $\hat\beta_1$ is calculated using correlation. Under what conditions do these two equal each other (on average?, e.g. $E[\hat\beta_1]=\beta_1$)
First, let’s simulate an erexample. We have 1000 students who each have a randomly generated “ability” score (mean 0, sd 1)
They then have an attendance rate that is a partially random and partially determined by ability.
Generate the score as $10*ability + 5*attendance + \varepsilon$, $\varepsilon \sim N(0,1)$

Unbiasedness

True Equation: $score = 10*ability_i + 5*attendance + \varepsilon$
$cor(ability,attendance) > 0$
Regression: $score_i = \beta_0 + \beta_1 attendance_i + \varepsilon_i$
What is $\beta_1$?
What is in the error term?
Is (A) $E[\hat\beta_1]>\beta_1$, (B) $E[\hat\beta_1]=\beta_1$, (C) $E[\hat\beta_1]<\beta_1$, (D) Too little information

Unbiasedness


Call:
lm(formula = score ~ attendance, data = dt)

Coefficients:
(Intercept)   attendance  
     -13.14        31.36

Unbiasedness

$\beta_1=5$, but $\hat\beta_1=31.36$!
Why?

Endogenity Explained

Our problem is that ability is correlated with both score and attendance

cor(dt$ability,dt$attendance)

[1] 0.7001283

cor(dt$ability,dt$score)

[1] 0.9923924

Since we don’t observe ability in our simple model, it appears in our error term.
Our general condition is exogeneity: we need our error term to be uncorrelated with $x$

Endogeneity Explained

More generally, $E[\hat\beta_1]=\beta_1 + cor(x,\varepsilon)\frac{\sigma_\varepsilon}{\sigma_x}$
- Our estimate equals the true estimate plus some bias term

A second example

Suppose we have the same setup, but now attendance is independent of ability. Will $E[\hat\beta_1]=5$?
- Same formula: $score = 10*ability_i + 5*attendance + \varepsilon$
- Ability is still in our error term

A second example


Call:
lm(formula = score ~ attendance, data = dt)

Coefficients:
(Intercept)   attendance  
    -0.2459       5.2914

Kinda! It’s close but not exact. What if we used more than 1000 students?

A second example


Call:
lm(formula = score ~ attendance, data = dt)

Coefficients:
(Intercept)   attendance  
   -0.01396      5.02854

yes!

Endogeneity: arrows of causality

We can interpret these graphically. if X and Y are related we can draw an arrow between them. In our core model X causes Y, and our error term causes Y as well.

Endogeneity: arrows of causality

A second example

In the first I increase attendance by 1, and grade increases by $\beta_1$. grade is also affected by $e$, but when I increase attendance e does not change, on average
In the second I increase attendance by 1, and grade directly increases by $\beta_1$, but ability also changed based on $cor(grade,ability)$ and grade changed by on $cor(ability,grade)$. In general as long as x and e is correlated we will have bias

Example: simpson’s paradox

Exogenity: when do we have it?

Under what conditions do we have $cor(x,\varepsilon)=0$? [called exogenous variation]
Randomized control trials. If we randomly assign x then any characteristics should average across treatment and control
Quasi-Experiments: Differences-in-Differences, Instrumental Variables, and Regression Discontinuity most common

Quasi-Experiment: DID

DiD: How do we measure the effect of immigrant on natives wages?
- Cuba sends a large wave of migrants to the US. Due to proximity, most settle in Miami. You measure the change in employment and compare it to a very similar nearby region that did not have this random wave of immigration.

Quasi-Experiment: IV

IV: How do we measure the effect of having a child on future wages?
- IUD’s are over 99% effective in preventing pregnancy. Use individuals whose birth control failed to estimate a causal effect of having a child on future wages

Quasi-Experiment: RD

RD: How do we determine the effect of passing a class on completing college?
- An individual who received a 69.9% in a course and a 70% are virutally identical, yet their outcome is very different (pass vs fail). Use this to determine the causal effect of passing a course
These will be covered in more detail later in the course.
Another option is to use a multivariate regression with “control variables”, which is next chapter

Administrative Miscellanea

Homework 3 due Friday midnight
Quiz 2 today in class (closed note)
- Regression focus
Exam 1 next Wednesday (math/stats review + bivariate OLS)
Class survey for extra credit (will send out an announcement once ready)
Bivariate OLS focusing on standard errors

iClicker

Which of the following statements are true concerning bias

A $\hat\beta_1$ is unbiased if $\hat\beta_1=\beta_1$
B $\hat\beta_1$ is unbiased if $E[\hat\beta_1]=\beta_1$
C $\hat\beta_1$ is biased if there is a large error term
D $\hat\beta_1$ will be biased when outliers are present in the data
E OLS regression produces unbiased estimates of $\hat\beta_1$

Question

You are interested in how the credit rating of a firm (proxied by the interest paid on their bonds) affects the returns of their stock (in percentage points per year). What is your regression equation of interest?

iClicker

You are interested in how the credit rating of a firm (proxied by the interest paid on their bonds) affects the returns of their stock (in percentage points per year) by running $return_i=\beta_0+\beta_1 credit_i + \varepsilon_i$. What would bias our estimate of $\hat\beta_1$?

Companies typically obtain good credit due to their superior operating and financial performance
Companies with bad credit pay more interest, reducing profits and increasing bankruptcy risk
On average, companies with bad credit operate in riskier industries which offer above average returns

iClicker

Stock prices follow a random walk, meaning that stock returns are effectively randomly determined
Firms with bad credit receive less analyst coverage, decreasing average returns

Distribution of $\hat\beta$

As we saw before, our estimate of $\hat\beta_1$ was unbiased in our second example, but even with 1000 students could be off by a substantial margin
In addition to the mean of $\hat\beta_1$ ($E[\hat\beta_1]$, which when unbiased is equal to $\beta_1$), we want to know the standard deviation
- To avoid confusion with the standard deviation of other terms, we call this the standard error

Distribution of $\hat\beta$

In its expanded form $\hat\beta_1=\frac{\sum(x_i-\bar x)(y_i-\bar y)}{\sum(x_i-\bar x)^2}$
- This doesn’t need to be memorized, but our estimate is just an average of $x_i, y_i$, and $x_i^2$ values
- If our estimate is unbiased we have asymptotic normality from the central limit theorem
How do we find the standard error or $\hat\beta_1$?

Standard Error of $\hat\beta_1$

Directly: $var(\hat\beta)=\frac{(\hat\beta_{11}-\beta_1)^2 + (\hat\beta_{12}-\beta_1)^2 + ... + (\hat\beta_{1n}-\beta_1)^2}{n}$
Substitute $\hat\beta_{1i}=\frac{\sum(x_i-\bar x)(y_i-\bar y)}{\sum(x_i-\bar x)^2}$
You now have a giant formula, simplify! (you’re mostly separating out terms and recombining into know forms)

Standard Error of $\hat\beta_1$

Result: $var(\hat\beta_1)=\frac{\hat\sigma^2}{nVar(x)}$
$\sigma^2$ is the variance of the regression, ie the variance around the line of fit
Recall $y_i=\beta_0+\beta_1 x +\varepsilon_i$
The variation around this line is $var(\varepsilon_i)$ ($=var(y_i-\hat y_i)$)

Standard Error of $\hat\beta_1$

Naturally we can also get standard errors directly from R (the standard deviation from before with n=1000 is 1.49, or 30% of the mean)

summary(lm(data=dt,score ~ attendance))


Call:
lm(formula = score ~ attendance, data = dt)

Residuals:
    Min      1Q  Median      3Q     Max 
-48.691  -6.781   0.009   6.786  47.478 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept) -0.01396    0.02613  -0.534               0.593    
attendance   5.02854    0.04824 104.240 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.05 on 999998 degrees of freedom
Multiple R-squared:  0.01075,   Adjusted R-squared:  0.01075 
F-statistic: 1.087e+04 on 1 and 999998 DF,  p-value: < 0.00000000000000022

Standard Error Implications

The more random variation we have in our model (from $\varepsilon$), the noisier our prediction will be
The more observations we have the more precise our estimate
The more variation we have in our explanatory variable (independent variable) the more precise our estimate

A note on consistency

Consistency means that as we increase our number of observations, the thing we’re measure ($\hat\beta_1$) will approach some value. OLS is a consistent estimator, we will eventually converge on some value
If $\hat\beta\to\beta_1$ as $n\to\infty$ we say that our estimator is asymptotically unbiased
Many common estimators are biased but asymptotically unbiased. Example: $\hat\sigma^2$

What assumptions are we making?

1: Unbiasedness: $cov(x,\varepsilon)=0$

What assumptions are we making?

2: Standard error of $\hat\beta_1=\frac{\hat\sigma^2}{nVar(x)}$?
Homoskedasticity: all errors have the same variance. $var(\varepsilon_1)=Var(\varepsilon_2)=...=Var(\varepsilon_n)$
Uncorrelated errors: $cor(\varepsilon_i,\varepsilon_j)=0$ for all $i,j$
- Autocorrelation in time series. Your income in 2024 is probably close to your income in 2023 (unless you’re graduating!)
- Cluster correlation: all students in a specific classroom are subject to construction sound during an exam

What assumptions are we making?

Given unbiasedness, how do we know this is the best way to estimate $\hat\beta$?
Gauss-Markov Theorem: OLS is the minimum variance unbiased linear estimator. Required conditions are exactly the ones above
What if we remove the linear requirement? Then we need only add that error terms are normally distributed.

A note on Expectation

$E[X]$ refers to $\bar x$ or $\mu_x$. If we obtain $\hat\beta=1$ it follows that $E[\hat\beta]=1$. Similarly $E[1]=1$

Homoskedasticity

One common form of heterokedasticity occurs when errors are proportional to the size of the observation:

Homoskedasticity

For simple cases like this we can solve the issue by using log(y) instead of y, but we often have more complicated error structures
We can correct for this (asymptotically) using robust standard errors. The formula is horrifying, but it’s easy to implement in software

Autocorrelation

Our data is autocorrelated if we can predict an observation using the previous one, ie $cor(y_i,y_{i-1})\neq0$
Time series data is often autocorrelated: if you have a stable job we can do a good job of predicting your income next year by using this year’s income as a base.
This is much more problematic, but sometimes we can solve it by differencing our model. Instead of using $y_i=\beta_0+\beta_1x_i+\varepsilon_i$ use $\Delta y_i=\beta_0 + \beta_1 \Delta x_i + \varepsilon_i$

Autocorrelation

We can assess autocorrelation using the autocorrelation function: acf in r. It gives $cor(y_i,y_j)$ for different values of j (lags)
Autocorrelation does not affect bias, only the precision of our estimates

Clustering of errors

Our data can also be correlated if they’re subject to common shocks. E.g. all students in one classroom have construction outside while taking an exam
Errors can be clustered in software to account for this. Again, the formula is fairly complex but simple to implement.

Assessing Goodness of Fit

Whether a model is a good fit, or practically signifcant, is highly context dependent. There are some measures that may be helpful
$\hat\sigma$, the standard error of the regression or root mean square error (rMSE) gives the variation around the line of best fit
Models can be compared to each other to assess predictive power, but if you just tell me the rMSE of a model I really can’t say anything about whether it’s a good fit

Assessing Goodness of Fit

$R^2$ (literally $r^2$ where r is our coefficient of correlation) gives the percent of variation in y explained by x.
$R^2$ is unitless and between 0 and 1
Hard to compare across contexts. $R^2=0.3$ might be low if comparing GDP across countries, while $R^2=0.01$ might be high if using individual data for insurance claims
In general a plot of the data is always a good idea

Highly influential observations

OLS regression can be highly sensitive to outliers, particularly if both the x and y values are extreme
The results are still unbiased with outliers: we often do nothing about them
You can assess how influential a point is by removing it from the data and seeing how $\hat\beta_1$ changes. The more data you have the less likely outliers are to be a major issue
- We will explore this more in problem set 2

Lecture 3

Correlation: Displacement vs Horsepower

Correlation: Displacement vs Miles per Gallon

Correlation: Real Axle Ratio vs Quarter Mile Time

Correlation Example: nonlinear

OLS Regression

OLS Regression: Causal Relationships

A Roadmap of What’s to Come

OLS Regression: “The Core Model”

OLS Regression: Example

OLS Regression: iClicker

OLS Regression: “The Core Model”

OLS Regression: “The Core Model”

Sources of variation

What we’re measuring

OLS Regression: Estimating the Core Model

OLS Questions

OLS Questions

OLS Regression: Some Observations

OLS Regression: Some Observations

OLS regression: ideas

OLS Regression: Graph

OLS Regression: Fitting

OLS Regression: Fitting

Question: Why SSE?

OLS Motivating Example: Student Scores

OLS Motivating Example: Student Scores

OLS Regression: How to Estimate

OLS Regression: How to Estimate

OLS Regression Estimates: Graphical

OLS Regression Estimates: Graphical

OLS Regression Estimates: Graphical

OLS Regression Estimates: Graphical

OLS Regression Estimates: Graphical

OLS Formula

OLS Formula

Properties of \(\hat\beta\)

Unbiasedness

Unbiasedness

Unbiasedness

Unbiasedness

Unbiasedness

Endogenity Explained

Endogeneity Explained

A second example

A second example

A second example

A second example

A second example

Endogeneity: arrows of causality

Endogeneity: arrows of causality

Endogeneity: arrows of causality

A second example

Example: simpson’s paradox

Example: simpson’s paradox

Exogenity: when do we have it?

Quasi-Experiment: DID

Quasi-Experiment: IV

Quasi-Experiment: RD

Administrative Miscellanea

iClicker

Question

iClicker

iClicker

Distribution of \(\hat\beta\)

Distribution of \(\hat\beta\)

Standard Error of \(\hat\beta_1\)

Standard Error of \(\hat\beta_1\)

Standard Error of \(\hat\beta_1\)

Standard Error Implications

A note on consistency

What assumptions are we making?

What assumptions are we making?

What assumptions are we making?

A note on Expectation

Homoskedasticity

Homoskedasticity

Autocorrelation

Autocorrelation

Clustering of errors