Undergrad Metrics - PS6

Author

Dor Leventer

Published

December 1, 2022

Q1

Say we estimate \[ price_i = 10 + 1.5\times bedroom_i + 2\times bathroom_i + 12\times area_i \]

What happens if landlord divides one bedroom into two bedrooms? Whats constant? Whats different?

beta_bedroom = 1.5
beta_bathroom = 2
beta_area = 12
alpha = 10

# all else constant, add one bedroom
# change equal to:
beta_bedroom

[1] 1.5

# all else constant, add one bedroom and add 3 to area
# change equal to:
beta_bedroom + 3*beta_area

[1] 37.5

Dor - remember, discuss ‘average’ change, and not ‘individual’ change.
Is this causal?

Q2

Load data, create variables

# load data
path = "/Users/dorleventer/Dropbox/teaching/undergrad_econometrics_spring_2023"
insert_data_path <- file.path(path, "data")
df = read.csv(glue::glue("{insert_data_path}/wage2.csv"))
# create pexp variable
df = df %>%
  mutate(
    log_wage = log(wage),
    pexp = age - educ - 6,
    pexp2 = pexp^2
  ) |> tibble()

df

# A tibble: 935 × 19
     obs   age black brthord  educ feduc hours    iq   kww lwage married meduc
   <int> <int> <int>   <int> <int> <int> <int> <int> <int> <dbl>   <int> <int>
 1     1    31     0       2    12     8    40    93    35  6.65       1     8
 2     2    37     0      NA    18    14    50   119    41  6.69       1    14
 3     3    33     0       2    14    14    40   108    46  6.72       1    14
 4     4    32     0       3    12    12    40    96    32  6.48       1    12
 5     5    34     0       6    11    11    40    74    27  6.33       1     6
 6     6    35     1       2    16    NA    40   116    43  7.24       1     8
 7     7    30     0       2    10     8    40    91    24  6.40       0     8
 8     8    38     0       3    18    NA    40   114    50  6.99       1     8
 9     9    36     0       3    15     5    45   111    37  7.05       1    14
10    10    36     0       1    12    11    40    95    44  6.91       1    12
# … with 925 more rows, and 7 more variables: sibs <int>, south <int>,
#   urban <int>, wage <int>, log_wage <dbl>, pexp <dbl>, pexp2 <dbl>

Estimate a model with multi-colinearity

Note that experience is fully determined by age and educ.

summary(lm(log_wage ~ educ + pexp + pexp2 + age, data=df))


Call:
lm(formula = log_wage ~ educ + pexp + pexp2 + age, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.86059 -0.22867  0.03587  0.26465  1.37655 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value             Pr(>|t|)    
(Intercept)  5.0529170  0.2133277  23.686 < 0.0000000000000002 ***
educ         0.0838286  0.0072517  11.560 < 0.0000000000000002 ***
pexp         0.0671036  0.0239406   2.803              0.00517 ** 
pexp2       -0.0015824  0.0008356  -1.894              0.05856 .  
age                 NA         NA      NA                   NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3939 on 931 degrees of freedom
Multiple R-squared:  0.1282,    Adjusted R-squared:  0.1254 
F-statistic: 45.64 on 3 and 931 DF,  p-value: < 0.00000000000000022

age was omitted!

Why? Due to co-linearity with pexp and educ.
How did it choose to omit age? Last variable, if we change order:

summary(lm(log_wage ~ educ + age + pexp + pexp2, data=df))


Call:
lm(formula = log_wage ~ educ + age + pexp + pexp2, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.86059 -0.22867  0.03587  0.26465  1.37655 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value             Pr(>|t|)    
(Intercept)  4.6502953  0.3429839  13.558 < 0.0000000000000002 ***
educ         0.0167250  0.0237091   0.705              0.48072    
age          0.0671036  0.0239406   2.803              0.00517 ** 
pexp                NA         NA      NA                   NA    
pexp2       -0.0015824  0.0008356  -1.894              0.05856 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3939 on 931 degrees of freedom
Multiple R-squared:  0.1282,    Adjusted R-squared:  0.1254 
F-statistic: 45.64 on 3 and 931 DF,  p-value: < 0.00000000000000022

New order, omitted last variable (this time, pexp).
What can we do? Choose which variable to omit, or define pexp differently.

Adding explaining/LHS variables

Basic model

We begin with omitting age.

summary(lm(log_wage ~ educ + pexp + pexp2, data=df))$coefficients[,1:2]

                Estimate   Std. Error
(Intercept)  5.052917009 0.2133277308
educ         0.083828588 0.0072516770
pexp         0.067103612 0.0239406453
pexp2       -0.001582433 0.0008355912

Ommited variable bias (OVB)

We continue with adding IQ.

summary(lm(log_wage ~ educ + pexp + pexp2 + iq, data=df))$coefficients[,1:2]

                Estimate   Std. Error
(Intercept)  4.697127993 0.2167882552
educ         0.063366078 0.0078313824
pexp         0.067876010 0.0234694198
pexp2       -0.001570575 0.0008191349
iq           0.006106719 0.0009805248

We added IQ, and the effect of education on wages decreased.
How could we have seen that coming?
- If we note two things:
- That the correlation between IQ and educ is positive.

cor(df$iq, df$educ)

[1] 0.515697

That the correlation between IQ and log_wage is positive.

cor(df$iq, df$lwage)

[1] 0.3147877

Then the estimator of educ (\(\hat{\beta}_{educ}\)), is biased upwards when we estiamte a model without IQ.

It is constructive to think of the two variables case:

Long model:

\[\begin{align} Y &= \alpha + \beta_1 educ + \beta_2 IQ + u \\ \hat{\beta}_1 &= \frac{\hat{cov}(educ,Y)}{\hat{var}(educ)} - \hat{\beta}_2\frac{\hat{cov}(educ,IQ)}{\hat{var}(educ)} \end{align}\]

Short model (leaving aside the hats…):

\[\begin{align} Y &= \tilde{\alpha} + \tilde{\beta}_1educ + v \\ \hat{\tilde{\beta}}_1 &= \frac{cov(educ,Y)}{var(educ)} \end{align}\]

Covariates model:

\[\begin{align} IQ &= \gamma_0 + \gamma_1 educ + e \\ \hat{\gamma}_1 &= \frac{cov(educ,IQ)}{var(educ)} \end{align}\]

Putting it all together:

\[\begin{align} \hat{\beta}_1 &= \frac{\hat{cov}(educ,y)}{\hat{var}(educ)} - \hat{\beta}_2\frac{\hat{cov}(educ,IQ)}{\hat{var}(educ)} \\ \leftrightarrow \hat{\beta}_1 &= \hat{\tilde{\beta}}_1 - \hat{\beta}_2\hat{\gamma}_1 \\ \leftrightarrow (\text{Short Model Estimator:})\quad\hat{\tilde{\beta}}_1 &= \hat{\beta}_1 + \hat{\beta}_2\hat{\gamma}_1 \\ &= Truth + Bias \end{align}\]

Translating what we wrote above to math,

we argued that \(\hat{\gamma}_1>0\) and \(\hat{\beta}_2>0\),
and so the estimator in the short model is upward biased.

Bottom line: adding IQ to estimation will decrease coefficient on educ.

More controls

We continue with adding fathers education.

summary(lm(log_wage ~ educ + pexp + pexp2 + iq + feduc, data=df))$coefficients[,1:2]

                Estimate   Std. Error
(Intercept)  4.769617360 0.2352397408
educ         0.054716879 0.0088027999
pexp         0.058818673 0.0255789941
pexp2       -0.001155708 0.0009027647
iq           0.005783902 0.0011411345
feduc        0.013049787 0.0047694996

Compare to previous estimators. What changed?
What does this for mean feduc as an OV?

Now add mothers education.

summary(lm(log_wage ~ educ + pexp + pexp2 + iq + feduc + meduc, data=df))$coefficients[,1:2]

                Estimate   Std. Error
(Intercept)  4.718291292 0.2420661714
educ         0.054332388 0.0089634235
pexp         0.064914445 0.0265187182
pexp2       -0.001421293 0.0009426224
iq           0.005469130 0.0011657645
feduc        0.009249090 0.0055402097
meduc        0.008844003 0.0063002757

What happened to the slope of educ?
What happened to stat. significance of feduc?
If we want a un-biased model, which would you choose?
Dor: Remember, small discussion on trade-off of adding more variables (bad controls, variance).

Q3

Bias of the slope

True model is long model \(prod = \beta_0 + \beta_1training + \beta_2 ability + u\)
Assumption: \(cor(training,ability)<0\)
What happens if we estimate small model \(prod = \alpha_0 + \alpha_1training + v\)?

Can you guess the sign of the bias? What if:

ability positively correlated with productivity, but
ability is negatively correlated with training?

Then \(\hat{\alpha}_1\) is downward-biased, w.r.t \(\beta_1\).

Lets show this explicitly.

From our assumption: \[cor(training,ability)<0\rightarrow cov(training,ability)<0\rightarrow\frac{cov(training,ability)}{var(training)}<0\]
Hence, when estimating \(ability = \gamma_0 + \gamma_1 training + e\), the estimator \(\hat{\gamma}_1 < 0\).
From OLS formula, \[\hat{\alpha}_1 = \frac{cov(training,prod)}{var(training)}\]
From OLS formula, \[\hat{\beta}_1 = \frac{cov(training,prod)}{var(training)} - \hat{\beta}_2\frac{cov(training,ability)}{var(training)}\]
Combining, we get \[\hat{\beta}_1 = \hat{\alpha}_1 - \hat{\beta}_2\hat{\gamma}_1\leftrightarrow \hat{\alpha}_1 = \hat{\beta}_1 + \hat{\beta}_2\hat{\gamma}_1\]
To sign the bias, all we need are the signs of \(\hat{\beta}_2\) and \(\hat{\gamma}_1\). Since argued that \(\hat{\gamma}_1<0\), and \(\hat{\beta}_2>0\), then \(\hat{\alpha}_1 = \hat{\beta}_1 + \text{Something negative}\). We say that this estimator is hence downward-biased.

Bias of the intercept

Lets look at \(\alpha_0\) directly, by inputing the formulas.

OLS formula \[\hat{\alpha}_0 = \overline{prod} - \hat{\alpha}_1 \overline{training}\] \[ \hat{\beta}_0 = \overline{prod} - \hat{\beta}_1 \overline{training} - \hat{\beta}_2\overline{ability}\] \[\hat{\gamma}_0 = \overline{ability} - \hat{\gamma}_1 \overline{training}\]
Input mean of explained variable into single covariate regression

\[\hat{\alpha}_0 = \hat{\beta}_0 + \hat{\beta}_1\overline{training} + \hat{\beta}_2\overline{ability} - \hat{\alpha}_1 \overline{training}\]

Input the OVB formula for the slope we found above \[\begin{align*} \hat{\alpha}_{0} & =\hat{\beta}_{0}+\hat{\beta}_{1}\overline{training}+\hat{\beta}_{2}\overline{ability}-\hat{\alpha}_{1}\overline{training}\\ & =\hat{\beta}_{0}+\hat{\beta}_{2}\overline{ability}-\left(\hat{\alpha}_{1}-\hat{\beta}_{1}\right)\overline{training}\\ & =\hat{\beta}_{0}+\hat{\beta}_{2}\overline{ability}-\left(\hat{\beta}_{2}\hat{\gamma}_{1}\right)\overline{training}\\ & =\hat{\beta}_{0}+\hat{\beta}_{2}\left(\overline{ability}-\hat{\gamma}_{1}\overline{training}\right) \end{align*}\]
Input estimate for the intercept from the covariate regression \[\begin{align*} \hat{\alpha}_{0} & =\hat{\beta}_{0}+\hat{\beta}_{2}\left(\overline{ability}-\hat{\gamma}_{1}\overline{training}\right)\\ & =\hat{\beta}_{0}+\hat{\beta}_{2}\hat{\gamma}_{0} \end{align*}\]

Q4

# load data
df = read.csv(glue::glue("{insert_data_path}/wage1.csv"))

# correlation between variable and its square
cor(df$exper,df$expersq)

[1] 0.9609709

Regression between two explaining variables

What do we expect to be the sign of the regression between the two?

If positive correlation,
then positive covariance,
and so sign of OLS estimator (covariance / variance) will be positive.

Check:

summary(lm(exper ~ expersq, data=df))$coefficients[,1]

(Intercept)     expersq 
 6.99388211  0.02117127

Regression when omitting one of these variables

What will happen if we regression log(wages) on experience, without experience squared?

Lets try without computing!
Saw experience squared is positively correlated with experience.
Can guess experience squared is negatively correlated with earnings.
And so we got here a downward-bias!

Check:

summary(lm(lwage ~ exper, data=df)) |> broom::tidy() |> select(term, estimate)

# A tibble: 2 × 2
  term        estimate
  <chr>          <dbl>
1 (Intercept)  1.55   
2 exper        0.00436

summary(lm(lwage ~ exper + expersq, data=df)) |> broom::tidy() |> select(term, estimate)

# A tibble: 3 × 2
  term         estimate
  <chr>           <dbl>
1 (Intercept)  1.30    
2 exper        0.0455  
3 expersq     -0.000944

And walla! After controlling for expersq, the estimator for exper has increased.

I.e., it was biased downwards, pre-controlling for its square.

Q5

Consider two equations of regressions of tenure on wages,

once with a first order polynomial: \(wage = \beta_0 + \beta_1 Tenure + \epsilon\)
and once with a second order polynomial: \(wage = \alpha_0 + \alpha_1 Tenure + \alpha_2 Tenure^2 + u\)

Proposition A

Proposition. If the model with a first order polynomial is true, then \(E[\hat{\alpha}_2]=0\).

Proof (True).

If first order model is true,

It must be that the population parameter \(\alpha_2=0\),
And that \(\mathbb{E}[\epsilon \mid Tenure] = 0\).

This implies, that if we estimate the second model, we get

Denote by \(e = \alpha_2 Tenure^2 + u\). Can re-write second model as

\[\begin{align} wage &= \alpha_0 + \alpha_1 Tenure + \underset{e}{\underbrace{\alpha_2 Tenure^2 + u}} \\ &= \alpha_0 + \alpha_1 Tenure + e \end{align}\]

In the model with \(\alpha_2 Tenure^2 + u=e\) in the error term, it holds that \[E[\epsilon\mid Tenure]=0\leftrightarrow E[e\mid Tenure]=0\leftrightarrow E[\alpha_2 Tenure^2 + u\mid Tenure]=0\]
From \(\alpha_2=0\) it follows that \[ E[\alpha_2 Tenure^2 + u \mid Tenure]=E[u\mid Tenure]=0\].

Hence,

The OLS estimators in the second order model are unbiased, and
Specifically this holds for the second estimator, i.e. \(E[\hat{\alpha}_2]=\alpha_2=0\).

Proposition B

Proposition. If the model with a second order polynomial is true, \(\alpha_1>0\) and \(\alpha_2<0\) then \(E[\hat{\beta}_1]\neq\alpha_1\), but we cannot sign the bias of \(\beta_1\) w.r.t \(\alpha_1\).

Proof (Not true).

Tenure squared is positively correlated with tenure, and negatively correlated with wages, and so we can sign the bias as negative. I.e., downward-bias. Did this a-lot above, so we continue without showing equations.

Proposition C

Proposition. The second order model with all variables in logs is estimatable.

Proof (Not true).

If all variables in logs, and it holds that \(log(Tenure^2)=2log(Tenure)\), then the model incorporates full multi-colinearity, and so we can’t estimate the model. In R, the regression would simply omit one of the variables.

Q6

Outline of article

Notation: Subjective happiness := \(SH\), some individual := \(i\)

Model: \(SH_i = \alpha + \beta income_i + u_i\)

Finding: \(\hat{\beta}>0\)

OVB

Whats missing?

Family background, education, surroundings, within-family correlations, health, …
Whats the sign of the bias? Remember, needs to be w.r.t some OV.
E.g., lets stick with education.
- Lets assume educated persons have more income,
- But since ignorance is bliss are less happy subjectively.
Whats the sign of the bias?
From all the above, negative (i.e., downward-bias).

Controlling

Is it easy to control for all these, and other relevant, parameters?
No!
So even if they had controlled for OV, can we give a causal interpretation to the OLS estimator?
No, but probably better than without controls (????????).
Discussion: so where have we got so far?

An experiment

Discussion: so how can we causally estimate the effect of income on happiness?
A thought experiment:
- Lets say we take random people, and divide to two groups
- Ask people how happy they are
- And then give one of these groups money
- And ask how happy they are, afterwards.
Is this good enough?
- What could go wrong?
- Lab vs. natural setting (external validity?)
- Reports vs. actual happiness (measurement error?)
Is there a possible natural experiment, where we can observe actual happiness?
- Discuss questions that are answerable vs. not.