Salary is in dollars. Grade is in GPA scale (1-4). Interpret \(\beta_1\)
A A 1 unit increase in GPA is associated with a \(\beta_1\) dollar increase in teacher salary, on average
B A 1 dollar increase in teacher salary is associated with a \(\beta_1\) unit increase in student GPA, on average
C A \(\beta_1\) unit increase in GPA is associated with a 1 dollar increase in teacher salary, on average
D A \(\beta_1\) dollar increase in teacher salary is associated with a 1 unit increase in student GPA, on average
OLS Regression: “The Core Model”
\(y_i=\beta_0 + \beta_1 x_i + \varepsilon_i\)
\(y_i\) is our outcome (dependent) variable for individual i. Here it’s student i’s grade
\(\beta_0\) is the intercept. If we take our model seriously it’s the average grade for a student whose teacher has 0 salary
\(x_i\) is the independent variable for individual i. Here it is student i’s teacher’s salary
\(\beta_1\) is the slope. This is the actual (average) causal effect of increasing x by 1 unit on y
OLS Regression: “The Core Model”
\(y_i=\beta_0 + \beta_1 x_i + \varepsilon_i\)
\(\varepsilon_i\) is the error term. It captures every factor not included in our model (which is a lot of things!)
Examples: student intelligence. Other student characteristics (e.g. demographics). Whether the student is interested in a subject. Whether the student slept in for an exam.
It has mean value of 0
Sources of variation
Suppose we talk about “the” causal effect of increasing teacher salary on grades. Consider the following ways that teacher salary can increase:
A more qualified teacher is hired
A bonus is paid based on performance
Teachers with low pay are laid off and class sizes increase
There is a shortage of teachers, driving up salaries
Salaries must be increased due to bad working conditions
Hours are increased, so salary is also increased
What we’re measuring
We’re measuring the average association between the two variables in the data
We’re getting a mix of all of the possible reasons why different teachers have different salarys (and students have different grades)
The weightings are based on the population and sample we use
This is called a reduced form estimate.
Just describing this data is often useful
OLS Regression: Estimating the Core Model
We don’t know the true values of \(\beta_0, \beta_1\), so we need to estimate
We end up estimating \(y_i = \hat\beta_0 + \hat\beta_1 x_i + \hat\varepsilon_i\)
Hats are used to indicate estimates. We know the actual value of x and y, but not anything else
Key question 1: If the model is this simplified, is it even useful?
From a predictive analytics perspective this is very weak. But as long as some simple assumptions are satisfied (covered later) this efficiently measures an average causal effect .
This is important for policy evaluation. If a union negotiates a salary increase what will happen to the average grade? This is the causal effect.
OLS Regression: Some Observations
Key Question 2: Given our model, how do we calculate \(\beta_1\) (and \(\beta_0\))?
We can never observe \(\beta_1\), but we can estimate it as \(\hat\beta\) using ordinary least squares regression
\(\hat\beta_1\) is a sample statistic (we’ll calculate later)
We then have to ask if \(\hat\beta_1\) is close to \(\beta_1\)
OLS regression: ideas
Our model is a line, and we have data. We estimate \(\beta_0,\beta_1\) by finding the best fit line to the observed data
We can measure the fit using sum of squared errors or mean squared error
Here we have a scatterplot of data, and the line \(y=3+1.1x\). For each point we can calculate the error term, then take the average to get the mean square error.
How do we know what the best fit line is?
Naive: for every possible \(\beta_0,\beta_1\) compute the MSE (or SSE), then choose the parameters that give the lowest value (best fit)
This is actually how many machine learning models work, but they use methods from calculus to make it fast
OLS Regression: Fitting
Why use mean square error and not mean error?
Note that for any \(\beta_1\), we can calculate \(\beta_0\) using simple algebra to make the mean 0
Question: Why SSE?
OLS Motivating Example: Student Scores
OLS Motivating Example: Student Scores
Call:
lm(formula = grade ~ attendance, data = dt)
Residuals:
Min 1Q Median 3Q Max
-63.514 -8.129 0.613 9.083 40.188
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 54.011 3.703 14.588 < 0.0000000000000002 ***
attendance 38.164 4.377 8.719 0.000000000000000941 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 15.85 on 206 degrees of freedom
(1 observation deleted due to missingness)
Multiple R-squared: 0.2695, Adjusted R-squared: 0.266
F-statistic: 76.02 on 1 and 206 DF, p-value: 0.0000000000000009408
Given our model, how do we calculate \(\beta_1\) (and \(\beta_0\))?
We can never observe \(\beta_1\), but we can estimate it as \(\hat\beta\) using ordinary least squares regression
We then have to ask if \(\hat\beta_1\) is close to \(\beta_1\)
OLS Regression: How to Estimate
Our model is a line, and we have data. We estimate \(\beta_0,\beta_1\) by finding the best fit line to the observed data
We can measure the fit using sum of squared errors or mean squared error . * \(SSE=\sum \varepsilon_i^2=\sum (y_i-\beta_0-\beta_1 x_i)^2\), \(MSE=\frac{SSE}{n}\)
OLS Regression Estimates: Graphical
SSE: 195.1958
OLS Regression Estimates: Graphical
What about now?
SSE: 171.4438
OLS Regression Estimates: Graphical
We need to shift our estimate up so that it’s halfway between the data (\(\bar e=0\)):
Note: \(\beta_0\) obtained through substitution: \(\hat\beta_0=\bar y -\hat\beta_1 \bar x\)
OLS Formula
Key question: is \(\hat\beta_1\approx\beta_1?\) ie as \(n\to\infty\) does \(\hat\beta_1\to\beta_1\)?
Seems unlikely. This is just correlation, yet we’re trying to measure a causal parameter. Correlation \(\neq\) Causation…
Properties of \(\hat\beta\)
Two important questions have not been addressed
What is the sampling distribution of \(\hat\beta\)? ie what is its variance. Distribution?
Under what conditions does \(\hat\beta\to\beta\)? Or more generally when is \(\hat\beta\)unbiased : \(E[\hat\beta]=\beta\)
Bias is a systematic property.
Unbiasedness
\(\beta_1\) is a causal parameter, but \(\hat\beta_1\) is calculated using correlation. Under what conditions do these two equal each other (on average?, e.g. \(E[\hat\beta_1]=\beta_1\))
First, let’s simulate an erexample. We have 1000 students who each have a randomly generated “ability” score (mean 0, sd 1)
They then have an attendance rate that is a partially random and partially determined by ability.
Generate the score as \(10*ability + 5*attendance + \varepsilon\), \(\varepsilon \sim N(0,1)\)
We can interpret these graphically. if X and Y are related we can draw an arrow between them. In our core model X causes Y, and our error term causes Y as well.
Endogeneity: arrows of causality
Endogeneity: arrows of causality
A second example
In the first I increase attendance by 1, and grade increases by \(\beta_1\). grade is also affected by \(e\), but when I increase attendance e does not change, on average
In the second I increase attendance by 1, and grade directly increases by \(\beta_1\), but ability also changed based on \(cor(grade,ability)\) and grade changed by on \(cor(ability,grade)\). In general as long as x and e is correlated we will have bias
Example: simpson’s paradox
Example: simpson’s paradox
Exogenity: when do we have it?
Under what conditions do we have \(cor(x,\varepsilon)=0\)? [called exogenous variation]
Randomized control trials. If we randomly assign x then any characteristics should average across treatment and control
Quasi-Experiments: Differences-in-Differences, Instrumental Variables, and Regression Discontinuity most common
Quasi-Experiment: DID
DiD: How do we measure the effect of immigrant on natives wages?
Cuba sends a large wave of migrants to the US. Due to proximity, most settle in Miami. You measure the change in employment and compare it to a very similar nearby region that did not have this random wave of immigration.
Quasi-Experiment: IV
IV: How do we measure the effect of having a child on future wages?
IUD’s are over 99% effective in preventing pregnancy. Use individuals whose birth control failed to estimate a causal effect of having a child on future wages
Quasi-Experiment: RD
RD: How do we determine the effect of passing a class on completing college?
An individual who received a 69.9% in a course and a 70% are virutally identical, yet their outcome is very different (pass vs fail). Use this to determine the causal effect of passing a course
These will be covered in more detail later in the course.
Another option is to use a multivariate regression with “control variables”, which is next chapter
Administrative Miscellanea
Homework 3 due Friday midnight
Quiz 2 today in class (closed note)
Regression focus
Exam 1 next Wednesday (math/stats review + bivariate OLS)
Class survey for extra credit (will send out an announcement once ready)
Bivariate OLS focusing on standard errors
iClicker
Which of the following statements are true concerning bias
A \(\hat\beta_1\) is unbiased if \(\hat\beta_1=\beta_1\)
B \(\hat\beta_1\) is unbiased if \(E[\hat\beta_1]=\beta_1\)
C \(\hat\beta_1\) is biased if there is a large error term
D \(\hat\beta_1\) will be biased when outliers are present in the data
E OLS regression produces unbiased estimates of \(\hat\beta_1\)
Question
You are interested in how the credit rating of a firm (proxied by the interest paid on their bonds) affects the returns of their stock (in percentage points per year). What is your regression equation of interest?
iClicker
You are interested in how the credit rating of a firm (proxied by the interest paid on their bonds) affects the returns of their stock (in percentage points per year) by running \(return_i=\beta_0+\beta_1 credit_i + \varepsilon_i\). What would bias our estimate of \(\hat\beta_1\)?
Companies typically obtain good credit due to their superior operating and financial performance
Companies with bad credit pay more interest, reducing profits and increasing bankruptcy risk
On average, companies with bad credit operate in riskier industries which offer above average returns
iClicker
You are interested in how the credit rating of a firm (proxied by the interest paid on their bonds) affects the returns of their stock (in percentage points per year) by running \(return_i=\beta_0+\beta_1 credit_i + \varepsilon_i\). What would bias our estimate of \(\hat\beta_1\)?
Stock prices follow a random walk, meaning that stock returns are effectively randomly determined
Firms with bad credit receive less analyst coverage, decreasing average returns
Distribution of \(\hat\beta\)
As we saw before, our estimate of \(\hat\beta_1\) was unbiased in our second example, but even with 1000 students could be off by a substantial margin
In addition to the mean of \(\hat\beta_1\) (\(E[\hat\beta_1]\), which when unbiased is equal to \(\beta_1\)), we want to know the standard deviation
To avoid confusion with the standard deviation of other terms, we call this the standard error
Distribution of \(\hat\beta\)
In its expanded form \(\hat\beta_1=\frac{\sum(x_i-\bar x)(y_i-\bar y)}{\sum(x_i-\bar x)^2}\)
This doesn’t need to be memorized, but our estimate is just an average of \(x_i, y_i\), and \(x_i^2\) values
If our estimate is unbiased we have asymptotic normality from the central limit theorem
How do we find the standard error or \(\hat\beta_1\)?
\(\sigma^2\) is the variance of the regression, ie the variance around the line of fit
Recall \(y_i=\beta_0+\beta_1 x +\varepsilon_i\)
The variation around this line is \(var(\varepsilon_i)\) (\(=var(y_i-\hat y_i)\))
Standard Error of \(\hat\beta_1\)
Naturally we can also get standard errors directly from R (the standard deviation from before with n=1000 is 1.49, or 30% of the mean)
summary(lm(data=dt,score ~ attendance))
Call:
lm(formula = score ~ attendance, data = dt)
Residuals:
Min 1Q Median 3Q Max
-48.691 -6.781 0.009 6.786 47.478
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.01396 0.02613 -0.534 0.593
attendance 5.02854 0.04824 104.240 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 10.05 on 999998 degrees of freedom
Multiple R-squared: 0.01075, Adjusted R-squared: 0.01075
F-statistic: 1.087e+04 on 1 and 999998 DF, p-value: < 0.00000000000000022
Standard Error Implications
The more random variation we have in our model (from \(\varepsilon\)), the noisier our prediction will be
The more observations we have the more precise our estimate
The more variation we have in our explanatory variable (independent variable) the more precise our estimate
A note on consistency
Consistency means that as we increase our number of observations, the thing we’re measure (\(\hat\beta_1\)) will approach some value. OLS is a consistent estimator, we will eventually converge on some value
If \(\hat\beta\to\beta_1\) as \(n\to\infty\) we say that our estimator is asymptotically unbiased
Many common estimators are biased but asymptotically unbiased. Example: \(\hat\sigma^2\)
What assumptions are we making?
1: Unbiasedness: \(cov(x,\varepsilon)=0\)
What assumptions are we making?
2: Standard error of \(\hat\beta_1=\frac{\hat\sigma^2}{nVar(x)}\)?
Homoskedasticity: all errors have the same variance. \(var(\varepsilon_1)=Var(\varepsilon_2)=...=Var(\varepsilon_n)\)
Uncorrelated errors: \(cor(\varepsilon_i,\varepsilon_j)=0\) for all \(i,j\)
Autocorrelation in time series. Your income in 2024 is probably close to your income in 2023 (unless you’re graduating!)
Cluster correlation: all students in a specific classroom are subject to construction sound during an exam
What assumptions are we making?
Given unbiasedness, how do we know this is the best way to estimate \(\hat\beta\)?
Gauss-Markov Theorem: OLS is the minimum variance unbiased linear estimator. Required conditions are exactly the ones above
What if we remove the linear requirement? Then we need only add that error terms are normally distributed.
A note on Expectation
\(E[X]\) refers to \(\bar x\) or \(\mu_x\). If we obtain \(\hat\beta=1\) it follows that \(E[\hat\beta]=1\). Similarly \(E[1]=1\)
Homoskedasticity
One common form of heterokedasticity occurs when errors are proportional to the size of the observation:
Homoskedasticity
For simple cases like this we can solve the issue by using log(y) instead of y, but we often have more complicated error structures
We can correct for this (asymptotically) using robust standard errors. The formula is horrifying, but it’s easy to implement in software
Autocorrelation
Our data is autocorrelated if we can predict an observation using the previous one, ie \(cor(y_i,y_{i-1})\neq0\)
Time series data is often autocorrelated: if you have a stable job we can do a good job of predicting your income next year by using this year’s income as a base.
This is much more problematic, but sometimes we can solve it by differencing our model. Instead of using \(y_i=\beta_0+\beta_1x_i+\varepsilon_i\) use \(\Delta y_i=\beta_0 + \beta_1 \Delta x_i + \varepsilon_i\)
Autocorrelation
We can assess autocorrelation using the autocorrelation function: acf in r. It gives \(cor(y_i,y_j)\) for different values of j (lags)
Autocorrelation does not affect bias, only the precision of our estimates
Clustering of errors
Our data can also be correlated if they’re subject to common shocks. E.g. all students in one classroom have construction outside while taking an exam
Errors can be clustered in software to account for this. Again, the formula is fairly complex but simple to implement.
Assessing Goodness of Fit
Whether a model is a good fit, or practically signifcant, is highly context dependent. There are some measures that may be helpful
\(\hat\sigma\), the standard error of the regression or root mean square error (rMSE) gives the variation around the line of best fit
Models can be compared to each other to assess predictive power, but if you just tell me the rMSE of a model I really can’t say anything about whether it’s a good fit
Assessing Goodness of Fit
\(R^2\) (literally \(r^2\) where r is our coefficient of correlation) gives the percent of variation in y explained by x.
\(R^2\) is unitless and between 0 and 1
Hard to compare across contexts. \(R^2=0.3\) might be low if comparing GDP across countries, while \(R^2=0.01\) might be high if using individual data for insurance claims
In general a plot of the data is always a good idea
Highly influential observations
OLS regression can be highly sensitive to outliers, particularly if both the x and y values are extreme
The results are still unbiased with outliers: we often do nothing about them
You can assess how influential a point is by removing it from the data and seeing how \(\hat\beta_1\) changes. The more data you have the less likely outliers are to be a major issue