If we want a un-biased model, which would you choose?
Dor: Remember, small discussion on trade-off of adding more variables (bad controls, variance).
Q3
Bias of the slope
True model is long model \(prod = \beta_0 + \beta_1training + \beta_2 ability + u\)
Assumption: \(cor(training,ability)<0\)
What happens if we estimate small model \(prod = \alpha_0 + \alpha_1training + v\)?
Can you guess the sign of the bias? What if:
ability positively correlated with productivity, but
ability is negatively correlated with training?
Then \(\hat{\alpha}_1\) is downward-biased, w.r.t \(\beta_1\).
Lets show this explicitly.
From our assumption: \[cor(training,ability)<0\rightarrow cov(training,ability)<0\rightarrow\frac{cov(training,ability)}{var(training)}<0\]
Hence, when estimating \(ability = \gamma_0 + \gamma_1 training + e\), the estimator \(\hat{\gamma}_1 < 0\).
From OLS formula, \[\hat{\alpha}_1 = \frac{cov(training,prod)}{var(training)}\]
From OLS formula, \[\hat{\beta}_1 = \frac{cov(training,prod)}{var(training)} - \hat{\beta}_2\frac{cov(training,ability)}{var(training)}\]
Combining, we get \[\hat{\beta}_1 = \hat{\alpha}_1 - \hat{\beta}_2\hat{\gamma}_1\leftrightarrow \hat{\alpha}_1 = \hat{\beta}_1 + \hat{\beta}_2\hat{\gamma}_1\]
To sign the bias, all we need are the signs of \(\hat{\beta}_2\) and \(\hat{\gamma}_1\). Since argued that \(\hat{\gamma}_1<0\), and \(\hat{\beta}_2>0\), then \(\hat{\alpha}_1 = \hat{\beta}_1 + \text{Something negative}\). We say that this estimator is hence downward-biased.
Bias of the intercept
Lets look at \(\alpha_0\) directly, by inputing the formulas.
Input the OVB formula for the slope we found above \[\begin{align*}
\hat{\alpha}_{0} & =\hat{\beta}_{0}+\hat{\beta}_{1}\overline{training}+\hat{\beta}_{2}\overline{ability}-\hat{\alpha}_{1}\overline{training}\\
& =\hat{\beta}_{0}+\hat{\beta}_{2}\overline{ability}-\left(\hat{\alpha}_{1}-\hat{\beta}_{1}\right)\overline{training}\\
& =\hat{\beta}_{0}+\hat{\beta}_{2}\overline{ability}-\left(\hat{\beta}_{2}\hat{\gamma}_{1}\right)\overline{training}\\
& =\hat{\beta}_{0}+\hat{\beta}_{2}\left(\overline{ability}-\hat{\gamma}_{1}\overline{training}\right)
\end{align*}\]
Input estimate for the intercept from the covariate regression \[\begin{align*}
\hat{\alpha}_{0} & =\hat{\beta}_{0}+\hat{\beta}_{2}\left(\overline{ability}-\hat{\gamma}_{1}\overline{training}\right)\\
& =\hat{\beta}_{0}+\hat{\beta}_{2}\hat{\gamma}_{0}
\end{align*}\]
Q4
# load datadf =read.csv(glue::glue("{insert_data_path}/wage1.csv"))# correlation between variable and its squarecor(df$exper,df$expersq)
[1] 0.9609709
Regression between two explaining variables
What do we expect to be the sign of the regression between the two?
If positive correlation,
then positive covariance,
and so sign of OLS estimator (covariance / variance) will be positive.
In the model with \(\alpha_2 Tenure^2 + u=e\) in the error term, it holds that \[E[\epsilon\mid Tenure]=0\leftrightarrow E[e\mid Tenure]=0\leftrightarrow E[\alpha_2 Tenure^2 + u\mid Tenure]=0\]
From \(\alpha_2=0\) it follows that \[ E[\alpha_2 Tenure^2 + u \mid Tenure]=E[u\mid Tenure]=0\].
Hence,
The OLS estimators in the second order model are unbiased, and
Specifically this holds for the second estimator, i.e. \(E[\hat{\alpha}_2]=\alpha_2=0\).
Proposition B
Proposition. If the model with a second order polynomial is true,\(\alpha_1>0\) and \(\alpha_2<0\) then \(E[\hat{\beta}_1]\neq\alpha_1\), but we cannot sign the bias of \(\beta_1\) w.r.t \(\alpha_1\).
Proof (Not true).
Tenure squared is positively correlated with tenure, and negatively correlated with wages, and so we can sign the bias as negative. I.e., downward-bias. Did this a-lot above, so we continue without showing equations.
Proposition C
Proposition. The second order model with all variables in logs is estimatable.
Proof (Not true).
If all variables in logs, and it holds that \(log(Tenure^2)=2log(Tenure)\), then the model incorporates full multi-colinearity, and so we can’t estimate the model. In R, the regression would simply omit one of the variables.
Q6
Outline of article
Notation: Subjective happiness := \(SH\), some individual := \(i\)
Model: \(SH_i = \alpha + \beta income_i + u_i\)
Finding: \(\hat{\beta}>0\)
OVB
Whats missing?
Family background, education, surroundings, within-family correlations, health, …
Whats the sign of the bias? Remember, needs to be w.r.t some OV.
E.g., lets stick with education.
Lets assume educated persons have more income,
But since ignorance is bliss are less happy subjectively.
Whats the sign of the bias?
From all the above, negative (i.e., downward-bias).
Controlling
Is it easy to control for all these, and other relevant, parameters?
No!
So even if they had controlled for OV, can we give a causal interpretation to the OLS estimator?
No, but probably better than without controls (????????).
Discussion: so where have we got so far?
An experiment
Discussion: so how can we causally estimate the effect of income on happiness?
A thought experiment:
Lets say we take random people, and divide to two groups
Ask people how happy they are
And then give one of these groups money
And ask how happy they are, afterwards.
Is this good enough?
What could go wrong?
Lab vs. natural setting (external validity?)
Reports vs. actual happiness (measurement error?)
Is there a possible natural experiment, where we can observe actual happiness?
Discuss questions that are answerable vs. not.
Source Code
---title: "Undergrad Metrics - PS6"author: "Dor Leventer"date: "December 2022"format: html: theme: cosmo toc: true toc-location: left page-layout: article embed-resources: true code-tools: trueeditor: visual---```{r setup}#| include: false# markdown code optionknitr::opts_chunk$set(echo =TRUE)# clear enviormentrm(list=ls())# librarieslibrary(dplyr)library(ggplot2)# tell R not to use Scientific notationoptions(scipen=999)```# Q1Say we estimate $$ price_i = 10 + 1.5\times bedroom_i + 2\times bathroom_i + 12\times area_i $$- What happens if landlord divides one bedroom into two bedrooms? Whats constant? Whats different?```{r}beta_bedroom =1.5beta_bathroom =2beta_area =12alpha =10# all else constant, add one bedroom# change equal to:beta_bedroom# all else constant, add one bedroom and add 3 to area# change equal to:beta_bedroom +3*beta_area```- Dor - remember, discuss 'average' change, and not 'individual' change.- Is this causal?# Q2Load data, create variables```{r}# load datapath ="/Users/dorleventer/Dropbox/teaching/undergrad_econometrics_spring_2023"insert_data_path <-file.path(path, "data")df =read.csv(glue::glue("{insert_data_path}/wage2.csv"))# create pexp variabledf = df %>%mutate(log_wage =log(wage),pexp = age - educ -6,pexp2 = pexp^2 ) |>tibble()df```## Estimate a model with multi-colinearityNote that experience is fully determined by age and educ.```{r}summary(lm(log_wage ~ educ + pexp + pexp2 + age, data=df))````age` was omitted!- Why? Due to co-linearity with `pexp` and `educ`.- How did it choose to omit `age`? Last variable, if we change order:```{r}summary(lm(log_wage ~ educ + age + pexp + pexp2, data=df))```- New order, omitted last variable (this time, `pexp`).- What can we do? Choose which variable to omit, or define `pexp` differently.## Adding explaining/LHS variables### Basic modelWe begin with omitting age.```{r}summary(lm(log_wage ~ educ + pexp + pexp2, data=df))$coefficients[,1:2]```## Ommited variable bias (OVB)We continue with adding IQ.```{r}summary(lm(log_wage ~ educ + pexp + pexp2 + iq, data=df))$coefficients[,1:2]```- We added IQ, and the effect of education on wages decreased.- How could we have seen that coming? - If we note two things: - That the correlation between `IQ` and `educ` is positive.```{r}cor(df$iq, df$educ)```- That the correlation between `IQ` and `log_wage` is positive.```{r}cor(df$iq, df$lwage)```- Then the estimator of `educ` ($\hat{\beta}_{educ}$), is biased upwards when we estiamte a model without `IQ`.It is constructive to think of the two variables case:- Long model:```{=tex}\begin{align} Y &= \alpha + \beta_1 educ + \beta_2 IQ + u \\ \hat{\beta}_1 &= \frac{\hat{cov}(educ,Y)}{\hat{var}(educ)} - \hat{\beta}_2\frac{\hat{cov}(educ,IQ)}{\hat{var}(educ)}\end{align}```- Short model (leaving aside the hats...):```{=tex}\begin{align} Y &= \tilde{\alpha} + \tilde{\beta}_1educ + v \\ \hat{\tilde{\beta}}_1 &= \frac{cov(educ,Y)}{var(educ)}\end{align}```- Covariates model:```{=tex}\begin{align} IQ &= \gamma_0 + \gamma_1 educ + e \\ \hat{\gamma}_1 &= \frac{cov(educ,IQ)}{var(educ)}\end{align}```Putting it all together:```{=tex}\begin{align} \hat{\beta}_1 &= \frac{\hat{cov}(educ,y)}{\hat{var}(educ)} - \hat{\beta}_2\frac{\hat{cov}(educ,IQ)}{\hat{var}(educ)} \\ \leftrightarrow \hat{\beta}_1 &= \hat{\tilde{\beta}}_1 - \hat{\beta}_2\hat{\gamma}_1 \\ \leftrightarrow (\text{Short Model Estimator:})\quad\hat{\tilde{\beta}}_1 &= \hat{\beta}_1 + \hat{\beta}_2\hat{\gamma}_1 \\ &= Truth + Bias\end{align}```Translating what we wrote above to math,- we argued that $\hat{\gamma}_1>0$ and $\hat{\beta}_2>0$,- and so the estimator in the short model is upward biased.*Bottom line*: adding `IQ` to estimation will decrease coefficient on `educ`.## More controlsWe continue with adding fathers education.```{r}summary(lm(log_wage ~ educ + pexp + pexp2 + iq + feduc, data=df))$coefficients[,1:2]```- Compare to previous estimators. What changed?- What does this for mean `feduc` as an OV?Now add mothers education.```{r}summary(lm(log_wage ~ educ + pexp + pexp2 + iq + feduc + meduc, data=df))$coefficients[,1:2]```- What happened to the slope of `educ`?- What happened to stat. significance of `feduc`?- If we want a un-biased model, which would you choose?- Dor: Remember, small discussion on trade-off of adding more variables (bad controls, variance).# Q3## Bias of the slope- True model is long model $prod = \beta_0 + \beta_1training + \beta_2 ability + u$- Assumption: $cor(training,ability)<0$- What happens if we estimate small model $prod = \alpha_0 + \alpha_1training + v$?Can you guess the sign of the bias? What if:1. ability positively correlated with productivity, but2. ability is negatively correlated with training?Then $\hat{\alpha}_1$ is *downward-biased*, w.r.t $\beta_1$.Lets show this explicitly.1. From our assumption: $$cor(training,ability)<0\rightarrow cov(training,ability)<0\rightarrow\frac{cov(training,ability)}{var(training)}<0$$2. Hence, when estimating $ability = \gamma_0 + \gamma_1 training + e$, the estimator $\hat{\gamma}_1 < 0$.3. From OLS formula, $$\hat{\alpha}_1 = \frac{cov(training,prod)}{var(training)}$$4. From OLS formula, $$\hat{\beta}_1 = \frac{cov(training,prod)}{var(training)} - \hat{\beta}_2\frac{cov(training,ability)}{var(training)}$$5. Combining, we get $$\hat{\beta}_1 = \hat{\alpha}_1 - \hat{\beta}_2\hat{\gamma}_1\leftrightarrow \hat{\alpha}_1 = \hat{\beta}_1 + \hat{\beta}_2\hat{\gamma}_1$$6. To sign the bias, all we need are the signs of $\hat{\beta}_2$ and $\hat{\gamma}_1$. Since argued that $\hat{\gamma}_1<0$, and $\hat{\beta}_2>0$, then $\hat{\alpha}_1 = \hat{\beta}_1 + \text{Something negative}$. We say that this estimator is hence downward-biased.## Bias of the interceptLets look at $\alpha_0$ directly, by inputing the formulas.1. OLS formula $$\hat{\alpha}_0 = \overline{prod} - \hat{\alpha}_1 \overline{training}$$ $$ \hat{\beta}_0 = \overline{prod} - \hat{\beta}_1 \overline{training} - \hat{\beta}_2\overline{ability}$$ $$\hat{\gamma}_0 = \overline{ability} - \hat{\gamma}_1 \overline{training}$$2. Input mean of explained variable into single covariate regression$$\hat{\alpha}_0 = \hat{\beta}_0 + \hat{\beta}_1\overline{training} + \hat{\beta}_2\overline{ability} - \hat{\alpha}_1 \overline{training}$$3. Input the OVB formula for the slope we found above \begin{align*} \hat{\alpha}_{0} & =\hat{\beta}_{0}+\hat{\beta}_{1}\overline{training}+\hat{\beta}_{2}\overline{ability}-\hat{\alpha}_{1}\overline{training}\\ & =\hat{\beta}_{0}+\hat{\beta}_{2}\overline{ability}-\left(\hat{\alpha}_{1}-\hat{\beta}_{1}\right)\overline{training}\\ & =\hat{\beta}_{0}+\hat{\beta}_{2}\overline{ability}-\left(\hat{\beta}_{2}\hat{\gamma}_{1}\right)\overline{training}\\ & =\hat{\beta}_{0}+\hat{\beta}_{2}\left(\overline{ability}-\hat{\gamma}_{1}\overline{training}\right) \end{align*}4. Input estimate for the intercept from the covariate regression \begin{align*} \hat{\alpha}_{0} & =\hat{\beta}_{0}+\hat{\beta}_{2}\left(\overline{ability}-\hat{\gamma}_{1}\overline{training}\right)\\ & =\hat{\beta}_{0}+\hat{\beta}_{2}\hat{\gamma}_{0} \end{align*}# Q4```{r}# load datadf =read.csv(glue::glue("{insert_data_path}/wage1.csv"))# correlation between variable and its squarecor(df$exper,df$expersq)```## Regression between two explaining variablesWhat do we expect to be the sign of the regression between the two?- If positive correlation,- then positive covariance,- and so sign of OLS estimator (covariance / variance) will be positive.Check:```{r}summary(lm(exper ~ expersq, data=df))$coefficients[,1]```## Regression when omitting one of these variablesWhat will happen if we regression log(wages) on experience, without experience squared?- Lets try without computing!- Saw experience squared is positively correlated with experience.- Can guess experience squared is *negatively* correlated with earnings.- And so we got here a downward-bias!Check:```{r}summary(lm(lwage ~ exper, data=df)) |> broom::tidy() |>select(term, estimate)summary(lm(lwage ~ exper + expersq, data=df)) |> broom::tidy() |>select(term, estimate)```And walla! After controlling for `expersq`, the estimator for `exper` has increased.- I.e., it was biased downwards, pre-controlling for its square.# Q5Consider two equations of regressions of tenure on wages,- once with a first order polynomial: $wage = \beta_0 + \beta_1 Tenure + \epsilon$- and once with a second order polynomial: $wage = \alpha_0 + \alpha_1 Tenure + \alpha_2 Tenure^2 + u$## Proposition A*Proposition. If the model with a first order polynomial is true, then* $E[\hat{\alpha}_2]=0$.*Proof (True).*If first order model is true,- It must be that the population parameter $\alpha_2=0$,- And that $\mathbb{E}[\epsilon \mid Tenure] = 0$.This implies, that if we estimate the second model, we get- Denote by $e = \alpha_2 Tenure^2 + u$. Can re-write second model as```{=tex}\begin{align}wage &= \alpha_0 + \alpha_1 Tenure + \underset{e}{\underbrace{\alpha_2 Tenure^2 + u}} \\&= \alpha_0 + \alpha_1 Tenure + e\end{align}```- In the model with $\alpha_2 Tenure^2 + u=e$ in the error term, it holds that $$E[\epsilon\mid Tenure]=0\leftrightarrow E[e\mid Tenure]=0\leftrightarrow E[\alpha_2 Tenure^2 + u\mid Tenure]=0$$- From $\alpha_2=0$ it follows that $$ E[\alpha_2 Tenure^2 + u \mid Tenure]=E[u\mid Tenure]=0$$.Hence,- The OLS estimators in the second order model are unbiased, and- Specifically this holds for the second estimator, i.e. $E[\hat{\alpha}_2]=\alpha_2=0$.## Proposition B*Proposition. If the model with a second order polynomial is true,* $\alpha_1>0$ and $\alpha_2<0$ then $E[\hat{\beta}_1]\neq\alpha_1$, but we cannot sign the bias of $\beta_1$ w.r.t $\alpha_1$.*Proof (Not true).*Tenure squared is positively correlated with tenure, and negatively correlated with wages, and so we can sign the bias as negative. I.e., downward-bias. Did this a-lot above, so we continue without showing equations.## Proposition C*Proposition. The second order model with all variables in logs is estimatable.**Proof (Not true).*If all variables in logs, and it holds that $log(Tenure^2)=2log(Tenure)$, then the model incorporates full multi-colinearity, and so we can't estimate the model. In R, the regression would simply omit one of the variables.# Q6## Outline of articleNotation: Subjective happiness := $SH$, some individual := $i$Model: $SH_i = \alpha + \beta income_i + u_i$Finding: $\hat{\beta}>0$## OVBWhats missing?- Family background, education, surroundings, within-family correlations, health, ...- Whats the sign of the bias? Remember, needs to be w.r.t some OV.- E.g., lets stick with education. - Lets assume educated persons have more income, - But since ignorance is bliss are less happy subjectively.- Whats the sign of the bias?- From all the above, negative (i.e., downward-bias).## Controlling- Is it easy to control for all these, and other relevant, parameters?- No!- So even if they had controlled for OV, can we give a causal interpretation to the OLS estimator?- No, but probably better than without controls (????????).- Discussion: so where have we got so far?## An experiment- Discussion: so how can we causally estimate the effect of income on happiness?- A thought experiment: - Lets say we take random people, and divide to two groups - Ask people how happy they are - And then give one of these groups money - And ask how happy they are, afterwards.- Is this good enough? - What could go wrong? - Lab vs. natural setting (external validity?) - Reports vs. actual happiness (measurement error?)- Is there a possible natural experiment, where we can observe actual happiness? - Discuss questions that are answerable vs. not.