Lecture 3

Administrative Miscellanea

Quiz Today During Class
Homework 2 due Friday Midnight
Problem Set 1 Due Next Friday

Correlation: Displacement vs Horsepower

Correlation: Answer

[1] 0.7909486

Correlation: Displacement vs Miles per Gallon

Correlation: Answer

[1] -0.8475514

Correlation: Real Axle Ratio vs Quarter Mile Time

Correlation: Answer

[1] 0.09120476

Correlation Example: nonlinear

Correlation Example: nonlinear, answer

[1] 1.239961e-16

OLS Regression

Our goal: to estimate the causal effect of some treatment
- e.g. how does welfare reform impact poverty rates?
- Does getting the flu vaccine improve your health?
- Does increasing teacher salary improve students outcomes?
- Do cats respond to cat music?
  - This is a real academic paper.

OLS Regression: Causal Relationships

Even simple causal models can have very complicated formulas. Think physics calculations
We need to simplify this model to get anywhere. We assume a linear relationship
e.g. student grades increase (or decrease) linearly with teacher salary
- Going from $20,000/year to $21,000/year has the same effect on grades as going from $150,000/year to $151,000/year
- This assumption can be relaxed later by introducing model transformations

A Roadmap of What’s to Come

Start with mechanics of OLS, and hand-waive most of the tricky parts regarding causality
Once we get to multivariate OLS we can try to address issues of bias
When we get to Potential Outcomes we can focus on intricacies surrounding causal effects
At the end we conclude with basic research designs

OLS Regression: “The Core Model”

We arrive at what the book calls the core model, though you will not see this terminology elsewhere
It is a generic formula that applies to any relationship between x and y:
$y_i=\beta_0 + \beta_1 x_i + \varepsilon_i$

OLS Regression: “The Core Model”

$y_i=\beta_0 + \beta_1 x_i + \varepsilon_i$
- $y_i$ is our outcome (dependent) variable for individual i. Here it’s student i’s grade
- $\beta_0$ is the intercept. If we take our model seriously it’s the average grade for a student with 0 salary
- $x_i$ is the independent variable for individual i. Here it is student i’s teacher’s salary
- $\beta_1$ is the slope. This is the actual (average) causal effect of increasing x by 1 on y

OLS Regression: “The Core Model”

$y_i=\beta_0 + \beta_1 x_i + \varepsilon_i$
$\varepsilon_i$ is the error term. It captures every factor not included in our model (which is a lot of things!)
- Examples: student intelligence. Other student characteristics (e.g. demographics). Whether the student is interested in a subject. Whether the student slept in for an exam.
- It has mean value of 0

OLS Regression: Estimating the Core Model

We don’t know the true values of $\beta_0, \beta_1$, so we need to estimate
We end up estimating $y_i = \hat\beta_0 + \hat\beta_1 x_i + \hat\varepsilon_i$
Hats are used to indicate estimates. We know the actual value of x and y, but not anything else

OLS Questions

$grade_i = \beta_0+\beta_1 salary_i + \varepsilon_i$
- $salary_i$ is salary (in thousands of dollars), $grade_i$ is final grade in percent (e.g. 100)

Suppose $\beta_0=60, \beta_1=1$ how do we interpret this?
Student 1 has a teacher who is paid $20000. Calculate $\hat grade_1$
Student 1’s actual grade in class was a 65. What is $\varepsilon_1$?

OLS Questions

$grade_i = \beta_0+\beta_1 salary_i + \varepsilon_i$
- $salary_i$ is salary (in thousands of dollars), $grade_i$ is final grade in percent (e.g. 100)

What is included in $\varepsilon$?
Suppose we estimate $\hat\beta_0=40,\hat\beta_1=2$. What is $\hat grade_1, \hat\varepsilon_1$?
What is included in $\hat\varepsilon$ that is not included in $\varepsilon$?

OLS Regression: Some Observations

Core model: $y_i=\beta_0+\beta_1 x_i +\varepsilon_i$

Key question 1: If the model is this simplified, is it even useful?
From a predictive analytics perspective this is very weak. But as long as some simple assumptions are satisfied (covered later) this efficiently measures an average causal effect .
This is important for policy evaluation. If a union negotiates a salary increase what will happen to the average grade? This is the causal effect.

OLS Regression: Some Observations

Key Question 2: Given our model, how do we calculate $\beta_1$ (and $\beta_0$)?
We can never observe $\beta_1$, but we can estimate it as $\hat\beta$ using ordinary least squares regression
$\hat\beta_1$ is a sample statistic (we’ll calculate later)
We then have to ask if $\hat\beta_1$ is close to $\beta_1$

OLS regression: ideas

Our model is a line, and we have data. We estimate $\beta_0,\beta_1$ by finding the best fit line to the observed data
We can measure the fit using sum of squared errors or mean squared error
$SSE=\sum \varepsilon_i^2=\sum (y_i-\beta_0-\beta_1 x_i)^2$, $MSE=SSE/n$

OLS Regression: Graph

OLS Regression: Fitting

Here we have a scatterplot of data, and the line $y=3+1.1x$. For each point we can calculate the error term, then take the average to get the mean square error.
How do we know what the best fit line is?
Naive: for every possible $\beta_0,\beta_1$ compute the MSE (or SSE), then choose the parameters that give the lowest value (best fit)
- This is actually how many learning models work, but they use methods from calculus to make it fast

OLS Regression: Fitting

Why use mean square error and not mean error?
- Note that for any $\beta_1$, we can calculate $\beta_0$ using simple algebra to make the mean 0

Question: Why SSE?

OLS Motivating Example: Student Scores


Call:
lm(formula = grade ~ attendance, data = dt)

Residuals:
    Min      1Q  Median      3Q     Max 
-63.514  -8.129   0.613   9.083  40.188 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   54.011      3.703  14.588  < 2e-16 ***
attendance    38.164      4.377   8.719 9.41e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.85 on 206 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.2695,    Adjusted R-squared:  0.266 
F-statistic: 76.02 on 1 and 206 DF,  p-value: 9.408e-16

OLS Regression: How to Estimate

Core model: $y_i=\beta_0+\beta_1 x_i +\varepsilon_i$

Given our model, how do we calculate $\beta_1$ (and $\beta_0$)?
We can never observe $\beta_1$, but we can estimate it as $\hat\beta$ using ordinary least squares regression
We then have to ask if $\hat\beta_1$ is close to $\beta_1$

OLS Regression: How to Estimate

Our model is a line, and we have data. We estimate $\beta_0,\beta_1$ by finding the best fit line to the observed data
We can measure the fit using sum of squared errors or mean squared error . * $SSE=\sum \varepsilon_i^2=\sum (y_i-\beta_0-\beta_1 x_i)^2$, $MSE=SSE/n$

OLS Regression Estimates: Graphical

SSE: 165.8958

OLS Regression Estimates: Graphical

What about now?

SSE: 164.6405

OLS Regression Estimates: Graphical

We need to shift our estimate up so that it’s halfway between the data ($\bar e=0$):

mean(output[[2]]$e)

[1] 3.184145

OLS Regression Estimates: Graphical

Now?

SSE: 12.55881

OLS Regression Estimates: Graphical

Now?!

SSE: 11.86442

OLS Formula

Our OLS model is simple enough that we have a simple formula for $\beta_1$
$\hat\beta_1=\sum(x_i-\bar x)(y_i-\bar y)/(\sum(x_i-\bar x)^2)$
$=E[(x-\bar x)(y-\bar y)]/var(x)$
$=cov(x,y)/var(x)$
$=\rho \sigma_y/\sigma_x$
Note: $\beta_0$ obtained through substitution: $\hat\beta_0=\bar y -\hat\beta_1 \bar x$

OLS Formula

Key question: is $\hat\beta_1\approx\beta_1?$ ie as $n\to\infty$ does $\hat\beta_1\to\beta_1$?
Seems unlikely. This is just correlation, yet we’re trying to measure a causal parameter. Correlation $\neq$ Causation…

Properties of $\hat\beta$

Two important questions have not been addressed
What is the sampling distribution of $\hat\beta$? ie what is its variance. Distribution?
Under what conditions does $\hat\beta\to\beta$? Or more generally when is $\hat\beta$ unbiased : $E[\hat\beta]=\beta$

Unbiasedness

$\beta_1$ is a causal parameter, but $\hat\beta_1$ is calculated using correlation. Under what conditions do these two equal each other (on average?, e.g. $E[\hat\beta_1]=\beta_1$)
First, let’s simulate a counterexample. We have 1000 students who each have a randomly generated “ability” score (mean 0, sd 1)
They then have an attendance rate that is a partially random and partially determined by ability.
Finally we generate the score as $10*ability + 5*attendance + \varepsilon$ where varepsilon is a random variable with mean 0 and sd 1

Unbiasedness

Here $\beta_1=5$: this is our causal parameter. Our $10*ability$ term is not specified in our simple regression so it is included in the error term of our model
What will the result be?

Unbiasedness


Call:
lm(formula = score ~ attendance, data = dt)

Coefficients:
(Intercept)   attendance  
     -12.91        30.88

Unbiasedness

$\beta_1=5$, but $\hat\beta_1=31.36$!
Why?

Endogenity Explained

Our problem is that ability is correlated with both score and attendance

cor(dt$ability,dt$attendance)

[1] 0.6945407

cor(dt$ability,dt$score)

[1] 0.9924159

Since we don’t observe ability in our simple model, it appears in our error term. Our general condition is exogeneity: we need our error term to be uncorrelated with $x$

Endogeneity Explained

More generally, $E[\hat\beta_1]=\beta_1 + cor(x,\varepsilon)*\sigma_\varepsilon/\sigma_x$
- Our estimate equals the true estimate plus some bias term

A second example

Suppose we have the same setup, but now attendance is independent of ability. Will $\hat\beta_1=5$?

A second example


Call:
lm(formula = score ~ attendance, data = dt)

Coefficients:
(Intercept)   attendance  
    -0.9932       7.2068

Kinda! It’s close but not exact. What if we used more than 1000 students?

A second example


Call:
lm(formula = score ~ attendance, data = dt)

Coefficients:
(Intercept)   attendance  
     0.0127       4.9690

yes!

Endogeneity: arrows of causality

We can interpret these graphically. if X and Y are related we can draw an arrow between them. In our core model X causes Y, and our error term causes Y as well.

Endogeneity: arrows of causality

A second example

In the first I increase attendance by 1, and grade increases by $\beta_1$. grade is also affected by $e$, but when I increase attendance e does not change, on average
In the second I increase attendance by 1, and grade directly increases by $\beta_1$, but ability also changed based on $cor(grade,ability)$ and grade changed by on $cor(ability,grade)$. In general as long as x and e is correlated we will have bias

Example: simpson’s paradox

Exogenity: when do we have it?

Under what conditions do we have $cor(x,\varepsilon)=0$? [called exogenous variation]
Randomized control trials. If we randomly assign x then any characteristics should average across treatment and control
Quasi-Experiments: Differences-in-Differences, Instrumental Variables, and Regression Discontinuity most common

Quasi-Experiment: DID

DiD: How do we measure the effect of immigrant on natives wages?
- Cuba sends a large wave of migrants to the US. Due to proximity, most settle in Miami. You measure the change in employment and compare it to a very similar nearby region that did not have this random wave of immigration.

Quasi-Experiment: IV

IV: How do we measure the effect of having a child on future wages?
- IUD’s are over 99% effective in preventing pregnancy. Use individuals whose birth control failed to estimate a causal effect of having a child on future wages

Quasi-Experiment: RD

RD: How do we determine the effect of passing a class on completing college?
- An individual who received a 69.9% in a course and a 70% are virutally identical, yet their outcome is very different (pass vs fail). Use this to determine the causal effect of passing a course
These will be covered in more detail later in the course.
Another option is to use a multivariate regression with “control variables”, which is next chapter

Distribution of $\hat\beta$

As we saw before, our estimate of $\hat\beta_1$ was unbiased in our second example, but even with 1000 students could be off by a substantial margin
In addition to the mean of $\hat\beta_1$ ($E[\hat\beta_1]$, which when unbiased is equal to $\beta_1$), we want to know the standard deviation
- To avoid confusion with the standard deviation of other terms, we call this the standard error

Distribution of $\hat\beta$

In its expanded form $\hat\beta_1=\sum(x_i-\bar x)(y_i-\bar y)/(\sum(x_i-\bar x)^2)$
- This doesn’t need to be memorized, but our estimate is just an average of $x_i, y_i$, and $x_i^2$ values
- If our estimate is unbiased we have asymptotic normality from the central limit theorem
How do we find the standard error or $\hat\beta_1$?

Standard Error of $\hat\beta_1$

Directly: $var(\hat\beta)=(\hat\beta_{11}-\beta_1)^2/n + (\hat\beta_{12}-\beta_1)^2/n + ... + (\hat\beta_{13}-\beta_1)^2/n$
Substitute $\hat\beta_{1i}=\sum(x_i-\bar x)(y_i-\bar y)/(\sum(x_i-\bar x)^2)$
You now have a giant formula, simplify! (you’re mostly separating out terms and recombining into know forms)

Standard Error of $\hat\beta_1$

Result: $var(\hat\beta_1)=\hat\sigma^2/(nVar(x))$
$\sigma^2$ is the variance of the regression, ie the variance around the line of fit
Recall $y_i=\beta_0+\beta_1 x +\varepsilon_i$
The variation around this line is $var(\varepsilon_i)$ ($=var(y_i-\hat y_i)$)

Standard Error of $\hat\beta_1$

Naturally we can also get standard errors directly from R (the standard deviation from before with n=1000 is 1.49, or 30% of the mean)

summary(lm(data=dt,score ~ attendance))


Call:
lm(formula = score ~ attendance, data = dt)

Residuals:
    Min      1Q  Median      3Q     Max 
-45.743  -6.771  -0.023   6.781  53.516 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.01270    0.02610   0.487    0.627    
attendance   4.96901    0.04818 103.128   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.04 on 999998 degrees of freedom
Multiple R-squared:  0.01052,   Adjusted R-squared:  0.01052 
F-statistic: 1.064e+04 on 1 and 999998 DF,  p-value: < 2.2e-16

Standard Error Implications

The more random variation we have in our model (from $\varepsilon$), the noisier our prediction will be
The more observations we have the more precise our estimate
The more variation we have in our explanatory variable (independent variable) the more precise our estimate

A note on consistency

Consistency means that as we increase our number of observations, the thing we’re measure ($\hat\beta_1$) will approach some value. OLS is a consistent estimator, we will eventually converge on some value
If $\hat\beta\to\beta_1$ as $n\to\infty$ we say that our estimator is asymptotically unbiased
Many common estimators are biased but asymptotically unbiased. Example: $\hat\sigma^2$

What assumptions are we making?

1: Unbiasedness: $cov(y,\varepsilon)=0$

What assumptions are we making?

2: Standard error of $\hat\beta_1=\hat\sigma^2/(nVar(x))$?
Homoskedasticity: all errors have the same variance. $var(\varepsilon_1)=Var(\varepsilon_2)=...=Var(\varepsilon_n)$
Uncorrelated errors: $cor(\varepsilon_i,\varepsilon_j)=0$ for all $i,j$
- Autocorrelation in time series. Your income in 2024 is probably close to your income in 2023 (unless you’re graduating!)
- Cluster correlation: all students in a specific classroom are subject to construction sound during an exam

What assumptions are we making?

Given unbiasedness, how do we know this is the best way to estimate $\hat\beta$?
Gauss-Markov Theorem: OLS is the minimum variance unbiased linear estimator. Required conditions are exactly the ones above
What if we remove the linear requirement? Then we need only add that error terms are normally distributed.

A note on Expectation

$E[X]$ refers to $\bar x$ or $\mu_x$. If we obtain $\hat\beta=1$ it follows that $E[\hat\beta]=1$. Similarly $E[1]=1$

Homoskedasticity

One common form of heterokedasticity occurs when errors are proportional to the size of the observation:

Homoskedasticity

For simple cases like this we can solve the issue by using log(y) instead of y, but we often have more complicated error structures
We can correct for this (asymptotically) using robust standard errors. The formula is horrifying, but it’s easy to implement in software

Autocorrelation

Our data is autocorrelated if we can predict an observation using the previous one, ie $cor(y_i,y_{i-1})\neq0$
Time series data is often autocorrelated: if you have a stable job we can do a good job of predicting your income next year by using this year’s income as a base.
This is much more problematic, but sometimes we can solve it by differencing our model. Instead of using $y_i=\beta_0+\beta_1x_i+\varepsilon_i$ use $\Delta y_i=\beta_0 + \beta_1 \Delta x_i + \varepsilon_i$

Autocorrelation

We can assess autocorrelation using the autocorrelation function: acf in r. It gives $cor(y_i,y_j)$ for different values of j (lags)
Autocorrelation does not affect bias, only the precision of our estimates

Clustering of errors

Our data can also be correlated if they’re subject to common shocks. E.g. all students in one classroom have construction outside while taking an exam
Errors can be clustered in software to account for this. Again, the formula is fairly complex but simple to implement.

Assessing Goodness of Fit

Whether a model is a good fit, or practically signifcant, is highly context dependent. There are some measures that may be helpful
$\hat\sigma$, the standard error of the regression or root mean square error (rMSE) gives the variation around the line of best fit
Models can be compared to each other to assess predictive power, but if you just tell me the rMSE of a model I really can’t say anything about whether it’s a good fit

Assessing Goodness of Fit

$R^2$ (literally $r^2$ where r is our coefficient of correlation) gives the percent of variation in y explained by x.
$R^2$ is unitless and between 0 and 1
Hard to compare across contexts. $R^2=0.3$ might be low if comparing GDP across countries, while $R^2=0.01$ might be high if using individual data for insurance claims
In general a plot of the data is always a good idea

Highly influential observations

OLS regression can be highly sensitive to outliers, particularly if both the x and y values are extreme
The results are still unbiased with outliers: we often do nothing about them
You can assess how influential a point is by removing it from the data and seeing how $\hat\beta_1$ changes. The more data you have the less likely outliers are to be a major issue
- We will explore this more in problem set 1

Lecture 3

Administrative Miscellanea

Correlation: Displacement vs Horsepower

Correlation: Answer

Correlation: Displacement vs Miles per Gallon

Correlation: Answer

Correlation: Real Axle Ratio vs Quarter Mile Time

Correlation: Answer

Correlation Example: nonlinear

Correlation Example: nonlinear, answer

OLS Regression

OLS Regression: Causal Relationships

A Roadmap of What’s to Come

OLS Regression: “The Core Model”

OLS Regression: “The Core Model”

OLS Regression: “The Core Model”

OLS Regression: Estimating the Core Model

OLS Questions

OLS Questions

OLS Regression: Some Observations

OLS Regression: Some Observations

OLS regression: ideas

OLS Regression: Graph

OLS Regression: Fitting

OLS Regression: Fitting

Question: Why SSE?

OLS Motivating Example: Student Scores

OLS Motivating Example: Student Scores

OLS Regression: How to Estimate

OLS Regression: How to Estimate

OLS Regression Estimates: Graphical

OLS Regression Estimates: Graphical

OLS Regression Estimates: Graphical

OLS Regression Estimates: Graphical

OLS Regression Estimates: Graphical

OLS Formula

OLS Formula

Properties of \(\hat\beta\)

Unbiasedness

Unbiasedness

Unbiasedness

Unbiasedness

Unbiasedness

Endogenity Explained

Endogeneity Explained

A second example

A second example

A second example

A second example

A second example

Endogeneity: arrows of causality

Endogeneity: arrows of causality

Endogeneity: arrows of causality

A second example

Example: simpson’s paradox

Exogenity: when do we have it?

Quasi-Experiment: DID

Quasi-Experiment: IV

Quasi-Experiment: RD

Distribution of \(\hat\beta\)

Distribution of \(\hat\beta\)

Standard Error of \(\hat\beta_1\)

Standard Error of \(\hat\beta_1\)

Standard Error of \(\hat\beta_1\)

Standard Error Implications

A note on consistency

What assumptions are we making?

What assumptions are we making?

What assumptions are we making?

A note on Expectation

Homoskedasticity

Homoskedasticity

Autocorrelation

Autocorrelation

Clustering of errors

Assessing Goodness of Fit

Assessing Goodness of Fit

Highly influential observations