Lecture 3

Administrative Miscellanea

  • Quiz Today During Class
  • Homework 2 due Friday Midnight
  • Problem Set 1 Due Next Friday

Correlation: Displacement vs Horsepower

Correlation: Answer

[1] 0.7909486

Correlation: Displacement vs Miles per Gallon

Correlation: Answer

[1] -0.8475514

Correlation: Real Axle Ratio vs Quarter Mile Time

Correlation: Answer

[1] 0.09120476

Correlation Example: nonlinear

Correlation Example: nonlinear, answer

[1] 1.239961e-16

OLS Regression

  • Our goal: to estimate the causal effect of some treatment
    • e.g. how does welfare reform impact poverty rates?
    • Does getting the flu vaccine improve your health?
    • Does increasing teacher salary improve students outcomes?
    • Do cats respond to cat music?
      • This is a real academic paper.

OLS Regression: Causal Relationships

  • Even simple causal models can have very complicated formulas. Think physics calculations
  • We need to simplify this model to get anywhere. We assume a linear relationship
  • e.g. student grades increase (or decrease) linearly with teacher salary
    • Going from $20,000/year to $21,000/year has the same effect on grades as going from $150,000/year to $151,000/year
    • This assumption can be relaxed later by introducing model transformations

A Roadmap of What’s to Come

  • Start with mechanics of OLS, and hand-waive most of the tricky parts regarding causality
  • Once we get to multivariate OLS we can try to address issues of bias
  • When we get to Potential Outcomes we can focus on intricacies surrounding causal effects
  • At the end we conclude with basic research designs

OLS Regression: “The Core Model”

  • We arrive at what the book calls the core model, though you will not see this terminology elsewhere
  • It is a generic formula that applies to any relationship between x and y:
  • \(y_i=\beta_0 + \beta_1 x_i + \varepsilon_i\)

OLS Regression: “The Core Model”

  • \(y_i=\beta_0 + \beta_1 x_i + \varepsilon_i\)
    • \(y_i\) is our outcome (dependent) variable for individual i. Here it’s student i’s grade
    • \(\beta_0\) is the intercept. If we take our model seriously it’s the average grade for a student with 0 salary
    • \(x_i\) is the independent variable for individual i. Here it is student i’s teacher’s salary
    • \(\beta_1\) is the slope. This is the actual (average) causal effect of increasing x by 1 on y

OLS Regression: “The Core Model”

  • \(y_i=\beta_0 + \beta_1 x_i + \varepsilon_i\)
  • \(\varepsilon_i\) is the error term. It captures every factor not included in our model (which is a lot of things!)
    • Examples: student intelligence. Other student characteristics (e.g. demographics). Whether the student is interested in a subject. Whether the student slept in for an exam.
    • It has mean value of 0

OLS Regression: Estimating the Core Model

  • We don’t know the true values of \(\beta_0, \beta_1\), so we need to estimate
  • We end up estimating \(y_i = \hat\beta_0 + \hat\beta_1 x_i + \hat\varepsilon_i\)
  • Hats are used to indicate estimates. We know the actual value of x and y, but not anything else

OLS Questions

  • \(grade_i = \beta_0+\beta_1 salary_i + \varepsilon_i\)
    • \(salary_i\) is salary (in thousands of dollars), \(grade_i\) is final grade in percent (e.g. 100)
  • Suppose \(\beta_0=60, \beta_1=1\) how do we interpret this?
  • Student 1 has a teacher who is paid $20000. Calculate \(\hat grade_1\)
  • Student 1’s actual grade in class was a 65. What is \(\varepsilon_1\)?

OLS Questions

  • \(grade_i = \beta_0+\beta_1 salary_i + \varepsilon_i\)
    • \(salary_i\) is salary (in thousands of dollars), \(grade_i\) is final grade in percent (e.g. 100)
  • What is included in \(\varepsilon\)?
  • Suppose we estimate \(\hat\beta_0=40,\hat\beta_1=2\). What is \(\hat grade_1, \hat\varepsilon_1\)?
  • What is included in \(\hat\varepsilon\) that is not included in \(\varepsilon\)?

OLS Regression: Some Observations

  • Core model: \(y_i=\beta_0+\beta_1 x_i +\varepsilon_i\)
  • Key question 1: If the model is this simplified, is it even useful?
  • From a predictive analytics perspective this is very weak. But as long as some simple assumptions are satisfied (covered later) this efficiently measures an average causal effect .
  • This is important for policy evaluation. If a union negotiates a salary increase what will happen to the average grade? This is the causal effect.

OLS Regression: Some Observations

  • Key Question 2: Given our model, how do we calculate \(\beta_1\) (and \(\beta_0\))?
  • We can never observe \(\beta_1\), but we can estimate it as \(\hat\beta\) using ordinary least squares regression
  • \(\hat\beta_1\) is a sample statistic (we’ll calculate later)
  • We then have to ask if \(\hat\beta_1\) is close to \(\beta_1\)

OLS regression: ideas

  • Our model is a line, and we have data. We estimate \(\beta_0,\beta_1\) by finding the best fit line to the observed data
  • We can measure the fit using sum of squared errors or mean squared error
  • \(SSE=\sum \varepsilon_i^2=\sum (y_i-\beta_0-\beta_1 x_i)^2\), \(MSE=SSE/n\)

OLS Regression: Graph

OLS Regression: Fitting

  • Here we have a scatterplot of data, and the line \(y=3+1.1x\). For each point we can calculate the error term, then take the average to get the mean square error.
  • How do we know what the best fit line is?
  • Naive: for every possible \(\beta_0,\beta_1\) compute the MSE (or SSE), then choose the parameters that give the lowest value (best fit)
    • This is actually how many learning models work, but they use methods from calculus to make it fast

OLS Regression: Fitting

  • Why use mean square error and not mean error?
    • Note that for any \(\beta_1\), we can calculate \(\beta_0\) using simple algebra to make the mean 0

Question: Why SSE?

OLS Motivating Example: Student Scores

OLS Motivating Example: Student Scores


Call:
lm(formula = grade ~ attendance, data = dt)

Residuals:
    Min      1Q  Median      3Q     Max 
-63.514  -8.129   0.613   9.083  40.188 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)   54.011      3.703  14.588  < 2e-16 ***
attendance    38.164      4.377   8.719 9.41e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 15.85 on 206 degrees of freedom
  (1 observation deleted due to missingness)
Multiple R-squared:  0.2695,    Adjusted R-squared:  0.266 
F-statistic: 76.02 on 1 and 206 DF,  p-value: 9.408e-16

OLS Regression: How to Estimate

  • Core model: \(y_i=\beta_0+\beta_1 x_i +\varepsilon_i\)
  • Given our model, how do we calculate \(\beta_1\) (and \(\beta_0\))?
  • We can never observe \(\beta_1\), but we can estimate it as \(\hat\beta\) using ordinary least squares regression
  • We then have to ask if \(\hat\beta_1\) is close to \(\beta_1\)

OLS Regression: How to Estimate

  • Our model is a line, and we have data. We estimate \(\beta_0,\beta_1\) by finding the best fit line to the observed data
  • We can measure the fit using sum of squared errors or mean squared error . * \(SSE=\sum \varepsilon_i^2=\sum (y_i-\beta_0-\beta_1 x_i)^2\), \(MSE=SSE/n\)

OLS Regression Estimates: Graphical

SSE: 165.8958 

OLS Regression Estimates: Graphical

  • What about now?
SSE: 164.6405 

OLS Regression Estimates: Graphical

We need to shift our estimate up so that it’s halfway between the data (\(\bar e=0\)):

mean(output[[2]]$e)
[1] 3.184145

OLS Regression Estimates: Graphical

  • Now?
SSE: 12.55881 

OLS Regression Estimates: Graphical

  • Now?!
SSE: 11.86442 

OLS Formula

  • Our OLS model is simple enough that we have a simple formula for \(\beta_1\)
  • \(\hat\beta_1=\sum(x_i-\bar x)(y_i-\bar y)/(\sum(x_i-\bar x)^2)\)
  • \(=E[(x-\bar x)(y-\bar y)]/var(x)\)
  • \(=cov(x,y)/var(x)\)
  • \(=\rho \sigma_y/\sigma_x\)
  • Note: \(\beta_0\) obtained through substitution: \(\hat\beta_0=\bar y -\hat\beta_1 \bar x\)

OLS Formula

  • Key question: is \(\hat\beta_1\approx\beta_1?\) ie as \(n\to\infty\) does \(\hat\beta_1\to\beta_1\)?
  • Seems unlikely. This is just correlation, yet we’re trying to measure a causal parameter. Correlation \(\neq\) Causation…

Properties of \(\hat\beta\)

  • Two important questions have not been addressed
  • What is the sampling distribution of \(\hat\beta\)? ie what is its variance. Distribution?
  • Under what conditions does \(\hat\beta\to\beta\)? Or more generally when is \(\hat\beta\) unbiased : \(E[\hat\beta]=\beta\)

Unbiasedness

  • \(\beta_1\) is a causal parameter, but \(\hat\beta_1\) is calculated using correlation. Under what conditions do these two equal each other (on average?, e.g. \(E[\hat\beta_1]=\beta_1\))
  • First, let’s simulate a counterexample. We have 1000 students who each have a randomly generated “ability” score (mean 0, sd 1)
  • They then have an attendance rate that is a partially random and partially determined by ability.
  • Finally we generate the score as \(10*ability + 5*attendance + \varepsilon\) where varepsilon is a random variable with mean 0 and sd 1

Unbiasedness

  • Here \(\beta_1=5\): this is our causal parameter. Our \(10*ability\) term is not specified in our simple regression so it is included in the error term of our model
  • What will the result be?

Unbiasedness

Unbiasedness


Call:
lm(formula = score ~ attendance, data = dt)

Coefficients:
(Intercept)   attendance  
     -12.91        30.88  

Unbiasedness

  • \(\beta_1=5\), but \(\hat\beta_1=31.36\)!
  • Why?

Endogenity Explained

  • Our problem is that ability is correlated with both score and attendance
cor(dt$ability,dt$attendance)
[1] 0.6945407
cor(dt$ability,dt$score)
[1] 0.9924159
  • Since we don’t observe ability in our simple model, it appears in our error term. Our general condition is exogeneity: we need our error term to be uncorrelated with \(x\)

Endogeneity Explained

  • More generally, \(E[\hat\beta_1]=\beta_1 + cor(x,\varepsilon)*\sigma_\varepsilon/\sigma_x\)
    • Our estimate equals the true estimate plus some bias term

A second example

  • Suppose we have the same setup, but now attendance is independent of ability. Will \(\hat\beta_1=5\)?

A second example

A second example


Call:
lm(formula = score ~ attendance, data = dt)

Coefficients:
(Intercept)   attendance  
    -0.9932       7.2068  
  • Kinda! It’s close but not exact. What if we used more than 1000 students?

A second example

A second example


Call:
lm(formula = score ~ attendance, data = dt)

Coefficients:
(Intercept)   attendance  
     0.0127       4.9690  
  • yes!

Endogeneity: arrows of causality

  • We can interpret these graphically. if X and Y are related we can draw an arrow between them. In our core model X causes Y, and our error term causes Y as well.

Endogeneity: arrows of causality

Endogeneity: arrows of causality

A second example

  • In the first I increase attendance by 1, and grade increases by \(\beta_1\). grade is also affected by \(e\), but when I increase attendance e does not change, on average
  • In the second I increase attendance by 1, and grade directly increases by \(\beta_1\), but ability also changed based on \(cor(grade,ability)\) and grade changed by on \(cor(ability,grade)\). In general as long as x and e is correlated we will have bias

Example: simpson’s paradox

Exogenity: when do we have it?

  • Under what conditions do we have \(cor(x,\varepsilon)=0\)? [called exogenous variation]
  • Randomized control trials. If we randomly assign x then any characteristics should average across treatment and control
  • Quasi-Experiments: Differences-in-Differences, Instrumental Variables, and Regression Discontinuity most common

Quasi-Experiment: DID

  • DiD: How do we measure the effect of immigrant on natives wages?
    • Cuba sends a large wave of migrants to the US. Due to proximity, most settle in Miami. You measure the change in employment and compare it to a very similar nearby region that did not have this random wave of immigration.

Quasi-Experiment: IV

  • IV: How do we measure the effect of having a child on future wages?
    • IUD’s are over 99% effective in preventing pregnancy. Use individuals whose birth control failed to estimate a causal effect of having a child on future wages

Quasi-Experiment: RD

  • RD: How do we determine the effect of passing a class on completing college?
    • An individual who received a 69.9% in a course and a 70% are virutally identical, yet their outcome is very different (pass vs fail). Use this to determine the causal effect of passing a course
  • These will be covered in more detail later in the course.
  • Another option is to use a multivariate regression with “control variables”, which is next chapter

Distribution of \(\hat\beta\)

  • As we saw before, our estimate of \(\hat\beta_1\) was unbiased in our second example, but even with 1000 students could be off by a substantial margin
  • In addition to the mean of \(\hat\beta_1\) (\(E[\hat\beta_1]\), which when unbiased is equal to \(\beta_1\)), we want to know the standard deviation
    • To avoid confusion with the standard deviation of other terms, we call this the standard error

Distribution of \(\hat\beta\)

  • In its expanded form \(\hat\beta_1=\sum(x_i-\bar x)(y_i-\bar y)/(\sum(x_i-\bar x)^2)\)
    • This doesn’t need to be memorized, but our estimate is just an average of \(x_i, y_i\), and \(x_i^2\) values
    • If our estimate is unbiased we have asymptotic normality from the central limit theorem
  • How do we find the standard error or \(\hat\beta_1\)?

Standard Error of \(\hat\beta_1\)

  • Directly: \(var(\hat\beta)=(\hat\beta_{11}-\beta_1)^2/n + (\hat\beta_{12}-\beta_1)^2/n + ... + (\hat\beta_{13}-\beta_1)^2/n\)
  • Substitute \(\hat\beta_{1i}=\sum(x_i-\bar x)(y_i-\bar y)/(\sum(x_i-\bar x)^2)\)
  • You now have a giant formula, simplify! (you’re mostly separating out terms and recombining into know forms)

Standard Error of \(\hat\beta_1\)

  • Result: \(var(\hat\beta_1)=\hat\sigma^2/(nVar(x))\)
  • \(\sigma^2\) is the variance of the regression, ie the variance around the line of fit
  • Recall \(y_i=\beta_0+\beta_1 x +\varepsilon_i\)
  • The variation around this line is \(var(\varepsilon_i)\) (\(=var(y_i-\hat y_i)\))

Standard Error of \(\hat\beta_1\)

  • Naturally we can also get standard errors directly from R (the standard deviation from before with n=1000 is 1.49, or 30% of the mean)
summary(lm(data=dt,score ~ attendance))

Call:
lm(formula = score ~ attendance, data = dt)

Residuals:
    Min      1Q  Median      3Q     Max 
-45.743  -6.771  -0.023   6.781  53.516 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.01270    0.02610   0.487    0.627    
attendance   4.96901    0.04818 103.128   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 10.04 on 999998 degrees of freedom
Multiple R-squared:  0.01052,   Adjusted R-squared:  0.01052 
F-statistic: 1.064e+04 on 1 and 999998 DF,  p-value: < 2.2e-16

Standard Error Implications

  • The more random variation we have in our model (from \(\varepsilon\)), the noisier our prediction will be
  • The more observations we have the more precise our estimate
  • The more variation we have in our explanatory variable (independent variable) the more precise our estimate

A note on consistency

  • Consistency means that as we increase our number of observations, the thing we’re measure (\(\hat\beta_1\)) will approach some value. OLS is a consistent estimator, we will eventually converge on some value
  • If \(\hat\beta\to\beta_1\) as \(n\to\infty\) we say that our estimator is asymptotically unbiased
  • Many common estimators are biased but asymptotically unbiased. Example: \(\hat\sigma^2\)

What assumptions are we making?

  • 1: Unbiasedness: \(cov(y,\varepsilon)=0\)

What assumptions are we making?

  • 2: Standard error of \(\hat\beta_1=\hat\sigma^2/(nVar(x))\)?
  • Homoskedasticity: all errors have the same variance. \(var(\varepsilon_1)=Var(\varepsilon_2)=...=Var(\varepsilon_n)\)
  • Uncorrelated errors: \(cor(\varepsilon_i,\varepsilon_j)=0\) for all \(i,j\)
    • Autocorrelation in time series. Your income in 2024 is probably close to your income in 2023 (unless you’re graduating!)
    • Cluster correlation: all students in a specific classroom are subject to construction sound during an exam

What assumptions are we making?

  • Given unbiasedness, how do we know this is the best way to estimate \(\hat\beta\)?
  • Gauss-Markov Theorem: OLS is the minimum variance unbiased linear estimator. Required conditions are exactly the ones above
  • What if we remove the linear requirement? Then we need only add that error terms are normally distributed.

A note on Expectation

  • \(E[X]\) refers to \(\bar x\) or \(\mu_x\). If we obtain \(\hat\beta=1\) it follows that \(E[\hat\beta]=1\). Similarly \(E[1]=1\)

Homoskedasticity

  • One common form of heterokedasticity occurs when errors are proportional to the size of the observation:

Homoskedasticity

  • For simple cases like this we can solve the issue by using log(y) instead of y, but we often have more complicated error structures
  • We can correct for this (asymptotically) using robust standard errors. The formula is horrifying, but it’s easy to implement in software

Autocorrelation

  • Our data is autocorrelated if we can predict an observation using the previous one, ie \(cor(y_i,y_{i-1})\neq0\)
  • Time series data is often autocorrelated: if you have a stable job we can do a good job of predicting your income next year by using this year’s income as a base.
  • This is much more problematic, but sometimes we can solve it by differencing our model. Instead of using \(y_i=\beta_0+\beta_1x_i+\varepsilon_i\) use \(\Delta y_i=\beta_0 + \beta_1 \Delta x_i + \varepsilon_i\)

Autocorrelation

  • We can assess autocorrelation using the autocorrelation function: acf in r. It gives \(cor(y_i,y_j)\) for different values of j (lags)
  • Autocorrelation does not affect bias, only the precision of our estimates

Clustering of errors

  • Our data can also be correlated if they’re subject to common shocks. E.g. all students in one classroom have construction outside while taking an exam
  • Errors can be clustered in software to account for this. Again, the formula is fairly complex but simple to implement.

Assessing Goodness of Fit

  • Whether a model is a good fit, or practically signifcant, is highly context dependent. There are some measures that may be helpful
  • \(\hat\sigma\), the standard error of the regression or root mean square error (rMSE) gives the variation around the line of best fit
  • Models can be compared to each other to assess predictive power, but if you just tell me the rMSE of a model I really can’t say anything about whether it’s a good fit

Assessing Goodness of Fit

  • \(R^2\) (literally \(r^2\) where r is our coefficient of correlation) gives the percent of variation in y explained by x.
  • \(R^2\) is unitless and between 0 and 1
  • Hard to compare across contexts. \(R^2=0.3\) might be low if comparing GDP across countries, while \(R^2=0.01\) might be high if using individual data for insurance claims
  • In general a plot of the data is always a good idea

Highly influential observations

  • OLS regression can be highly sensitive to outliers, particularly if both the x and y values are extreme
  • The results are still unbiased with outliers: we often do nothing about them
  • You can assess how influential a point is by removing it from the data and seeing how \(\hat\beta_1\) changes. The more data you have the less likely outliers are to be a major issue
    • We will explore this more in problem set 1