Undergrad Metrics - PS6

Author

Dor Leventer

Published

December 1, 2022

Q1

Say we estimate \[ price_i = 10 + 1.5\times bedroom_i + 2\times bathroom_i + 12\times area_i \]

  • What happens if landlord divides one bedroom into two bedrooms? Whats constant? Whats different?
beta_bedroom = 1.5
beta_bathroom = 2
beta_area = 12
alpha = 10

# all else constant, add one bedroom
# change equal to:
beta_bedroom
[1] 1.5
# all else constant, add one bedroom and add 3 to area
# change equal to:
beta_bedroom + 3*beta_area
[1] 37.5
  • Dor - remember, discuss ‘average’ change, and not ‘individual’ change.

  • Is this causal?

Q2

Load data, create variables

# load data
path = "/Users/dorleventer/Dropbox/teaching/undergrad_econometrics_spring_2023"
insert_data_path <- file.path(path, "data")
df = read.csv(glue::glue("{insert_data_path}/wage2.csv"))
# create pexp variable
df = df %>%
  mutate(
    log_wage = log(wage),
    pexp = age - educ - 6,
    pexp2 = pexp^2
  ) |> tibble()

df
# A tibble: 935 × 19
     obs   age black brthord  educ feduc hours    iq   kww lwage married meduc
   <int> <int> <int>   <int> <int> <int> <int> <int> <int> <dbl>   <int> <int>
 1     1    31     0       2    12     8    40    93    35  6.65       1     8
 2     2    37     0      NA    18    14    50   119    41  6.69       1    14
 3     3    33     0       2    14    14    40   108    46  6.72       1    14
 4     4    32     0       3    12    12    40    96    32  6.48       1    12
 5     5    34     0       6    11    11    40    74    27  6.33       1     6
 6     6    35     1       2    16    NA    40   116    43  7.24       1     8
 7     7    30     0       2    10     8    40    91    24  6.40       0     8
 8     8    38     0       3    18    NA    40   114    50  6.99       1     8
 9     9    36     0       3    15     5    45   111    37  7.05       1    14
10    10    36     0       1    12    11    40    95    44  6.91       1    12
# … with 925 more rows, and 7 more variables: sibs <int>, south <int>,
#   urban <int>, wage <int>, log_wage <dbl>, pexp <dbl>, pexp2 <dbl>

Estimate a model with multi-colinearity

Note that experience is fully determined by age and educ.

summary(lm(log_wage ~ educ + pexp + pexp2 + age, data=df))

Call:
lm(formula = log_wage ~ educ + pexp + pexp2 + age, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.86059 -0.22867  0.03587  0.26465  1.37655 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value             Pr(>|t|)    
(Intercept)  5.0529170  0.2133277  23.686 < 0.0000000000000002 ***
educ         0.0838286  0.0072517  11.560 < 0.0000000000000002 ***
pexp         0.0671036  0.0239406   2.803              0.00517 ** 
pexp2       -0.0015824  0.0008356  -1.894              0.05856 .  
age                 NA         NA      NA                   NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3939 on 931 degrees of freedom
Multiple R-squared:  0.1282,    Adjusted R-squared:  0.1254 
F-statistic: 45.64 on 3 and 931 DF,  p-value: < 0.00000000000000022

age was omitted!

  • Why? Due to co-linearity with pexp and educ.

  • How did it choose to omit age? Last variable, if we change order:

summary(lm(log_wage ~ educ + age + pexp + pexp2, data=df))

Call:
lm(formula = log_wage ~ educ + age + pexp + pexp2, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.86059 -0.22867  0.03587  0.26465  1.37655 

Coefficients: (1 not defined because of singularities)
              Estimate Std. Error t value             Pr(>|t|)    
(Intercept)  4.6502953  0.3429839  13.558 < 0.0000000000000002 ***
educ         0.0167250  0.0237091   0.705              0.48072    
age          0.0671036  0.0239406   2.803              0.00517 ** 
pexp                NA         NA      NA                   NA    
pexp2       -0.0015824  0.0008356  -1.894              0.05856 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3939 on 931 degrees of freedom
Multiple R-squared:  0.1282,    Adjusted R-squared:  0.1254 
F-statistic: 45.64 on 3 and 931 DF,  p-value: < 0.00000000000000022
  • New order, omitted last variable (this time, pexp).

  • What can we do? Choose which variable to omit, or define pexp differently.

Adding explaining/LHS variables

Basic model

We begin with omitting age.

summary(lm(log_wage ~ educ + pexp + pexp2, data=df))$coefficients[,1:2]
                Estimate   Std. Error
(Intercept)  5.052917009 0.2133277308
educ         0.083828588 0.0072516770
pexp         0.067103612 0.0239406453
pexp2       -0.001582433 0.0008355912

Ommited variable bias (OVB)

We continue with adding IQ.

summary(lm(log_wage ~ educ + pexp + pexp2 + iq, data=df))$coefficients[,1:2]
                Estimate   Std. Error
(Intercept)  4.697127993 0.2167882552
educ         0.063366078 0.0078313824
pexp         0.067876010 0.0234694198
pexp2       -0.001570575 0.0008191349
iq           0.006106719 0.0009805248
  • We added IQ, and the effect of education on wages decreased.

  • How could we have seen that coming?

    • If we note two things:

    • That the correlation between IQ and educ is positive.

cor(df$iq, df$educ)
[1] 0.515697
  • That the correlation between IQ and log_wage is positive.
cor(df$iq, df$lwage)
[1] 0.3147877
  • Then the estimator of educ (\(\hat{\beta}_{educ}\)), is biased upwards when we estiamte a model without IQ.

It is constructive to think of the two variables case:

  • Long model:
\[\begin{align} Y &= \alpha + \beta_1 educ + \beta_2 IQ + u \\ \hat{\beta}_1 &= \frac{\hat{cov}(educ,Y)}{\hat{var}(educ)} - \hat{\beta}_2\frac{\hat{cov}(educ,IQ)}{\hat{var}(educ)} \end{align}\]
  • Short model (leaving aside the hats…):
\[\begin{align} Y &= \tilde{\alpha} + \tilde{\beta}_1educ + v \\ \hat{\tilde{\beta}}_1 &= \frac{cov(educ,Y)}{var(educ)} \end{align}\]
  • Covariates model:
\[\begin{align} IQ &= \gamma_0 + \gamma_1 educ + e \\ \hat{\gamma}_1 &= \frac{cov(educ,IQ)}{var(educ)} \end{align}\]

Putting it all together:

\[\begin{align} \hat{\beta}_1 &= \frac{\hat{cov}(educ,y)}{\hat{var}(educ)} - \hat{\beta}_2\frac{\hat{cov}(educ,IQ)}{\hat{var}(educ)} \\ \leftrightarrow \hat{\beta}_1 &= \hat{\tilde{\beta}}_1 - \hat{\beta}_2\hat{\gamma}_1 \\ \leftrightarrow (\text{Short Model Estimator:})\quad\hat{\tilde{\beta}}_1 &= \hat{\beta}_1 + \hat{\beta}_2\hat{\gamma}_1 \\ &= Truth + Bias \end{align}\]

Translating what we wrote above to math,

  • we argued that \(\hat{\gamma}_1>0\) and \(\hat{\beta}_2>0\),

  • and so the estimator in the short model is upward biased.

Bottom line: adding IQ to estimation will decrease coefficient on educ.

More controls

We continue with adding fathers education.

summary(lm(log_wage ~ educ + pexp + pexp2 + iq + feduc, data=df))$coefficients[,1:2]
                Estimate   Std. Error
(Intercept)  4.769617360 0.2352397408
educ         0.054716879 0.0088027999
pexp         0.058818673 0.0255789941
pexp2       -0.001155708 0.0009027647
iq           0.005783902 0.0011411345
feduc        0.013049787 0.0047694996
  • Compare to previous estimators. What changed?

  • What does this for mean feduc as an OV?

Now add mothers education.

summary(lm(log_wage ~ educ + pexp + pexp2 + iq + feduc + meduc, data=df))$coefficients[,1:2]
                Estimate   Std. Error
(Intercept)  4.718291292 0.2420661714
educ         0.054332388 0.0089634235
pexp         0.064914445 0.0265187182
pexp2       -0.001421293 0.0009426224
iq           0.005469130 0.0011657645
feduc        0.009249090 0.0055402097
meduc        0.008844003 0.0063002757
  • What happened to the slope of educ?

  • What happened to stat. significance of feduc?

  • If we want a un-biased model, which would you choose?

  • Dor: Remember, small discussion on trade-off of adding more variables (bad controls, variance).

Q3

Bias of the slope

  • True model is long model \(prod = \beta_0 + \beta_1training + \beta_2 ability + u\)

  • Assumption: \(cor(training,ability)<0\)

  • What happens if we estimate small model \(prod = \alpha_0 + \alpha_1training + v\)?

Can you guess the sign of the bias? What if:

  1. ability positively correlated with productivity, but

  2. ability is negatively correlated with training?

Then \(\hat{\alpha}_1\) is downward-biased, w.r.t \(\beta_1\).

Lets show this explicitly.

  1. From our assumption: \[cor(training,ability)<0\rightarrow cov(training,ability)<0\rightarrow\frac{cov(training,ability)}{var(training)}<0\]

  2. Hence, when estimating \(ability = \gamma_0 + \gamma_1 training + e\), the estimator \(\hat{\gamma}_1 < 0\).

  3. From OLS formula, \[\hat{\alpha}_1 = \frac{cov(training,prod)}{var(training)}\]

  4. From OLS formula, \[\hat{\beta}_1 = \frac{cov(training,prod)}{var(training)} - \hat{\beta}_2\frac{cov(training,ability)}{var(training)}\]

  5. Combining, we get \[\hat{\beta}_1 = \hat{\alpha}_1 - \hat{\beta}_2\hat{\gamma}_1\leftrightarrow \hat{\alpha}_1 = \hat{\beta}_1 + \hat{\beta}_2\hat{\gamma}_1\]

  6. To sign the bias, all we need are the signs of \(\hat{\beta}_2\) and \(\hat{\gamma}_1\). Since argued that \(\hat{\gamma}_1<0\), and \(\hat{\beta}_2>0\), then \(\hat{\alpha}_1 = \hat{\beta}_1 + \text{Something negative}\). We say that this estimator is hence downward-biased.

Bias of the intercept

Lets look at \(\alpha_0\) directly, by inputing the formulas.

  1. OLS formula \[\hat{\alpha}_0 = \overline{prod} - \hat{\alpha}_1 \overline{training}\] \[ \hat{\beta}_0 = \overline{prod} - \hat{\beta}_1 \overline{training} - \hat{\beta}_2\overline{ability}\] \[\hat{\gamma}_0 = \overline{ability} - \hat{\gamma}_1 \overline{training}\]

  2. Input mean of explained variable into single covariate regression

\[\hat{\alpha}_0 = \hat{\beta}_0 + \hat{\beta}_1\overline{training} + \hat{\beta}_2\overline{ability} - \hat{\alpha}_1 \overline{training}\]

  1. Input the OVB formula for the slope we found above \[\begin{align*} \hat{\alpha}_{0} & =\hat{\beta}_{0}+\hat{\beta}_{1}\overline{training}+\hat{\beta}_{2}\overline{ability}-\hat{\alpha}_{1}\overline{training}\\ & =\hat{\beta}_{0}+\hat{\beta}_{2}\overline{ability}-\left(\hat{\alpha}_{1}-\hat{\beta}_{1}\right)\overline{training}\\ & =\hat{\beta}_{0}+\hat{\beta}_{2}\overline{ability}-\left(\hat{\beta}_{2}\hat{\gamma}_{1}\right)\overline{training}\\ & =\hat{\beta}_{0}+\hat{\beta}_{2}\left(\overline{ability}-\hat{\gamma}_{1}\overline{training}\right) \end{align*}\]

  2. Input estimate for the intercept from the covariate regression \[\begin{align*} \hat{\alpha}_{0} & =\hat{\beta}_{0}+\hat{\beta}_{2}\left(\overline{ability}-\hat{\gamma}_{1}\overline{training}\right)\\ & =\hat{\beta}_{0}+\hat{\beta}_{2}\hat{\gamma}_{0} \end{align*}\]

Q4

# load data
df = read.csv(glue::glue("{insert_data_path}/wage1.csv"))

# correlation between variable and its square
cor(df$exper,df$expersq)
[1] 0.9609709

Regression between two explaining variables

What do we expect to be the sign of the regression between the two?

  • If positive correlation,

  • then positive covariance,

  • and so sign of OLS estimator (covariance / variance) will be positive.

Check:

summary(lm(exper ~ expersq, data=df))$coefficients[,1]
(Intercept)     expersq 
 6.99388211  0.02117127 

Regression when omitting one of these variables

What will happen if we regression log(wages) on experience, without experience squared?

  • Lets try without computing!

  • Saw experience squared is positively correlated with experience.

  • Can guess experience squared is negatively correlated with earnings.

  • And so we got here a downward-bias!

Check:

summary(lm(lwage ~ exper, data=df)) |> broom::tidy() |> select(term, estimate)
# A tibble: 2 × 2
  term        estimate
  <chr>          <dbl>
1 (Intercept)  1.55   
2 exper        0.00436
summary(lm(lwage ~ exper + expersq, data=df)) |> broom::tidy() |> select(term, estimate)
# A tibble: 3 × 2
  term         estimate
  <chr>           <dbl>
1 (Intercept)  1.30    
2 exper        0.0455  
3 expersq     -0.000944

And walla! After controlling for expersq, the estimator for exper has increased.

  • I.e., it was biased downwards, pre-controlling for its square.

Q5

Consider two equations of regressions of tenure on wages,

  • once with a first order polynomial: \(wage = \beta_0 + \beta_1 Tenure + \epsilon\)

  • and once with a second order polynomial: \(wage = \alpha_0 + \alpha_1 Tenure + \alpha_2 Tenure^2 + u\)

Proposition A

Proposition. If the model with a first order polynomial is true, then \(E[\hat{\alpha}_2]=0\).

Proof (True).

If first order model is true,

  • It must be that the population parameter \(\alpha_2=0\),

  • And that \(\mathbb{E}[\epsilon \mid Tenure] = 0\).

This implies, that if we estimate the second model, we get

  • Denote by \(e = \alpha_2 Tenure^2 + u\). Can re-write second model as
\[\begin{align} wage &= \alpha_0 + \alpha_1 Tenure + \underset{e}{\underbrace{\alpha_2 Tenure^2 + u}} \\ &= \alpha_0 + \alpha_1 Tenure + e \end{align}\]
  • In the model with \(\alpha_2 Tenure^2 + u=e\) in the error term, it holds that \[E[\epsilon\mid Tenure]=0\leftrightarrow E[e\mid Tenure]=0\leftrightarrow E[\alpha_2 Tenure^2 + u\mid Tenure]=0\]

  • From \(\alpha_2=0\) it follows that \[ E[\alpha_2 Tenure^2 + u \mid Tenure]=E[u\mid Tenure]=0\].

Hence,

  • The OLS estimators in the second order model are unbiased, and

  • Specifically this holds for the second estimator, i.e. \(E[\hat{\alpha}_2]=\alpha_2=0\).

Proposition B

Proposition. If the model with a second order polynomial is true, \(\alpha_1>0\) and \(\alpha_2<0\) then \(E[\hat{\beta}_1]\neq\alpha_1\), but we cannot sign the bias of \(\beta_1\) w.r.t \(\alpha_1\).

Proof (Not true).

Tenure squared is positively correlated with tenure, and negatively correlated with wages, and so we can sign the bias as negative. I.e., downward-bias. Did this a-lot above, so we continue without showing equations.

Proposition C

Proposition. The second order model with all variables in logs is estimatable.

Proof (Not true).

If all variables in logs, and it holds that \(log(Tenure^2)=2log(Tenure)\), then the model incorporates full multi-colinearity, and so we can’t estimate the model. In R, the regression would simply omit one of the variables.

Q6

Outline of article

Notation: Subjective happiness := \(SH\), some individual := \(i\)

Model: \(SH_i = \alpha + \beta income_i + u_i\)

Finding: \(\hat{\beta}>0\)

OVB

Whats missing?

  • Family background, education, surroundings, within-family correlations, health, …

  • Whats the sign of the bias? Remember, needs to be w.r.t some OV.

  • E.g., lets stick with education.

    • Lets assume educated persons have more income,

    • But since ignorance is bliss are less happy subjectively.

  • Whats the sign of the bias?

  • From all the above, negative (i.e., downward-bias).

Controlling

  • Is it easy to control for all these, and other relevant, parameters?

  • No!

  • So even if they had controlled for OV, can we give a causal interpretation to the OLS estimator?

  • No, but probably better than without controls (????????).

  • Discussion: so where have we got so far?

An experiment

  • Discussion: so how can we causally estimate the effect of income on happiness?

  • A thought experiment:

    • Lets say we take random people, and divide to two groups

    • Ask people how happy they are

    • And then give one of these groups money

    • And ask how happy they are, afterwards.

  • Is this good enough?

    • What could go wrong?

    • Lab vs. natural setting (external validity?)

    • Reports vs. actual happiness (measurement error?)

  • Is there a possible natural experiment, where we can observe actual happiness?

    • Discuss questions that are answerable vs. not.