Undergrad Metrics - PS7

Author

Dor Leventer

Published

December 1, 2022

Q1

Upload cost data, create variables

insert_data_path <- "/Users/dorleventer/Dropbox/teaching/undergrad_econometrics_spring_2023/data"
df = read.csv(glue::glue("{insert_data_path}/production.csv"))[-1] %>%
  mutate(
    log_pf = log(pf),
    log_pk = log(pk),
    log_pl = log(pl),
    log_y = log(y),
    log_c = log(cost)
  )

df |> tibble()

# A tibble: 99 × 11
     obs  cost    pf    pk    pl     y log_pf log_pk log_pl log_y  log_c
   <int> <dbl> <dbl> <dbl> <dbl> <int>  <dbl>  <dbl>  <dbl> <dbl>  <dbl>
 1     1 0.213  18    64.9 6869.     8   2.89   4.17   8.83  2.08 -1.55 
 2     2 0.489  34.2  86.1 5439.    14   3.53   4.46   8.60  2.64 -0.716
 3     3 0.616  32.1  90.5 9204.    50   3.47   4.51   9.13  3.91 -0.485
 4     4 0.761  28.5  41.2 8972.    65   3.35   3.72   9.10  4.17 -0.274
 5     5 0.636  25.4  58.3 6696.    67   3.23   4.06   8.81  4.20 -0.452
 6     6 1.15   21.5  79.1 7190.    90   3.07   4.37   8.88  4.50  0.137
 7     7 1.34   35.5  74.4 5063.   183   3.57   4.31   8.53  5.21  0.294
 8     8 2.26   39.2  71.9 8218.   295   3.67   4.28   9.01  5.69  0.815
 9     9 2.05   26.3  82.5 7885.   374   3.27   4.41   8.97  5.92  0.719
10    10 3.15   42.5  60.3 7895.   378   3.75   4.10   8.97  5.93  1.15 
# … with 89 more rows

Q1.A

Original equation.

\[C=BY^{\beta_Y}P_L^{\beta_L}P_K^{\beta_K}P_F^{\beta_F}e^\omega\]

Consider \(log(C)\): \[\begin{align} \log(C) &= \log(BY^{\beta_Y}P_L^{\beta_L}P_K^{\beta_K}P_F^{\beta_F}e^\omega) \\ &= \log(B) + \log(Y^{\beta_Y}) + \log(P_L^{\beta_L}) + \log(P_K^{\beta_K}) + \log(P_F^{\beta_F}) + \log(e^\omega) \\ &= \log(B) + \beta_Y\log(Y) + \beta_L\log(P_L) + \beta_K\log(P_K) + \beta_F\log(P_F) + \omega\log(e) \end{align}\]

Ended up with an equation that is linear in parameters. Estimate:

summary(lm(log_c ~ log_y + log_pl + log_pk + log_pf, data = df))


Call:
lm(formula = log_c ~ log_y + log_pl + log_pk + log_pf, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.49179 -0.12897 -0.00924  0.09748  0.86660 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept) -8.54875    1.39458  -6.130    0.000000020478882 ***
log_y        0.82410    0.01309  62.981 < 0.0000000000000002 ***
log_pl       0.17789    0.14256   1.248                0.215    
log_pk       0.20286    0.14457   1.403                0.164    
log_pf       0.69325    0.08071   8.590    0.000000000000182 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.212 on 94 degrees of freedom
Multiple R-squared:  0.9798,    Adjusted R-squared:  0.979 
F-statistic:  1141 on 4 and 94 DF,  p-value: < 0.00000000000000022

Q1.B

Consider production function (F is Fuel)

\[Y=AL^{\alpha_L}K^{\alpha_K}F^{\alpha_F}e^u\] We are told that \[\sum_j\alpha_j = r = 1/\beta_Y\]

Q1.B.1

Does the production function display a constant returns to scale? If \(r=1\leftrightarrow\beta_Y=1\).

Reminder, this is a similar procedure to single variable case: \[H_0:\beta_Y = 1\] which we test using the t statistic \[\frac{\hat{\beta}_Y-1}{\hat\sigma_{\hat{\beta}_Y}}=t\]

Test:

# Manually:
beta_y = summary(lm(log_c ~ log_y + log_pl + log_pk + log_pf, data = df))$coefficients[2,1]
sigma_beta_y = summary(lm(log_c ~ log_y + log_pl + log_pk + log_pf, data = df))$coefficients[2,2]
t = (beta_y-1)/sigma_beta_y
t

[1] -13.44271

# Very small, so we reject H0.

# How do get p-value / significance?
degrees_freedom = summary(lm(log_c ~ log_y + log_pl + log_pk + log_pf, data = df))$df[2]
pt(t, degrees_freedom)

[1] 0.000000000000000000000006486226

# very small p-value, reject H0 that beta_y is equal 1.

# How to test automatically?
mod = lm(log_c ~ log_y + log_pl + log_pk + log_pf, data = df)
test = car::linearHypothesis(model = mod, "log_y = 1")
test

Linear hypothesis test

Hypothesis:
log_y = 1

Model 1: restricted model
Model 2: log_c ~ log_y + log_pl + log_pk + log_pf

  Res.Df     RSS Df Sum of Sq      F                Pr(>F)    
1     95 12.3427                                              
2     94  4.2235  1    8.1192 180.71 < 0.00000000000000022 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# got an F statistic, can get the t using its root
sqrt(test$F[2])

[1] 13.44271

Q1.B.2

Is \(Cost(Y,P_L,P_K,P_F)\) a homogeneous function in the prices of degree 1?

This is true if

\[\begin{align} Cost(Y,hP_L,hP_K,hP_F) &= h^1Cost(Y,P_L,P_K,P_F) \\ \leftrightarrow BY^{\beta_Y}(hP_L)^{\beta_L}(hP_K)^{\beta_K}(hP_F)^{\beta_F}e^\omega &= h^{\beta_L}h^{\beta_K}h^{\beta_F}BY^{\beta_Y}P_L^{\beta_L}P_K^{\beta_K}P_F^{\beta_F}e^\omega \\ &=h^{\sum_{i\in\{L,K,F\}}\beta_i}Cost(Y,P_L,P_K,P_F) \end{align}\]

So for the above to be true, it needs to hold that \[\sum_{i\in\{L,K,F\}}\beta_i = 1\]

What hypothesis is this equivalent to? \[H_0: \sum_{i\in\{L,K,F\}}\beta_i - 1 = 0\]

Q1.B.2 - t-test

Calculation procedure in slides

Write as

\[t = \frac{\beta_L+\beta_K+\beta_F - 1}{\sqrt{Var(\beta_L+\beta_K+\beta_F)}}\]

Whats in the denominator?

\[Var(\beta_L+\beta_K+\beta_F - 1) = Var(\beta_L) + Var(\beta_K) + Var(\beta_F) + 2(Cov(\beta_L,\beta_K) + Cov(\beta_L,\beta_F) + Cov(\beta_K,\beta_F)\]

Lets estimate this.

## begin with manual calculation

# coefficients estimation
beta_L = mod$coefficients[[3]] # two parentheses to avoid name recording in values
beta_K = mod$coefficients[[4]]
beta_F = mod$coefficients[[5]]
# variance covariance estimation
cov_mat <- vcov(mod, data = df)[-1,][,-1]
cov_mat

                log_y        log_pl        log_pk         log_pf
log_y   0.00017121763 -0.0002824858 -0.0003755597 -0.00005281364
log_pl -0.00028248577  0.0203236223  0.0020192218 -0.00348960591
log_pk -0.00037555967  0.0020192218  0.0209011076 -0.00105158318
log_pf -0.00005281364 -0.0034896059 -0.0010515832  0.00651369276

sigma_LL = cov_mat[2,2]
sigma_KK = cov_mat[3,3]
sigma_FF = cov_mat[4,4]
sigma_LK = cov_mat[2,3]
sigma_LF = cov_mat[2,4]
sigma_KF = cov_mat[3,4]
# now can calculate the t-statistics for the hypthesis
t = (beta_L + beta_K + beta_F - 1)/((sigma_LL + sigma_KK + sigma_FF +
                                       2*(sigma_LK + sigma_LF + sigma_KF)))^0.5
t

[1] 0.3581413

## Now for the automatic version, calculate an F-statistic and compute its square
test = car::linearHypothesis(model = mod, "log_pl + log_pk + log_pf = 1")
test

Linear hypothesis test

Hypothesis:
log_pl  + log_pk  + log_pf = 1

Model 1: restricted model
Model 2: log_c ~ log_y + log_pl + log_pk + log_pf

  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     95 4.2292                           
2     94 4.2235  1  0.005763 0.1283  0.721

sqrt(test$F[2])

[1] 0.3581413

Q1.B.2 - F-test

Can write the hypothesis like so: \[\beta_F = 1-\beta_L-\beta_K\].

Insert into model to get the restricted model (denote with \(R\)) \[\begin{align} log(Cost) &= \log(B) + \beta_Y\log(Y) + \beta_L\log(P_L) + \beta_K\log(P_K) + (1-\beta_L-\beta_K)\log(P_F) + \omega \\ &= \log(B) + \beta_Y\log(Y) + \beta_L[\log(P_L)-\log(P_F)] + \beta_K[\log(P_K)-\log(P_F)] + 1\log(P_F) + \omega \end{align}\]

Calculate for model \(R\) and un-restricted model (denote with \(U\)) the following sum of squared errors \[ESS^R = \sum (\hat{\omega}^R)^2\quad\&\quad ESS^U = \sum (\hat{\omega}^U)^2\] And a corresponding F-statistic \[F=\frac{(ESS^R-ESS^U)/d}{ESS^U/(n-k-1)}\sim F_{d,n-k-1}\] where

\(d\) is number of constraints, and
\(k\) is number of variables, so \(n-k-1\) is number of degrees of freedom

Calculate F-statistic for this hypothesis

# un-restricted model and ess
mod_u = mod
ess_u = sum(mod_u$residuals^2)

# now do a restricted model estimation
df = df %>%
  mutate(
    log_c_minus_log_pf = log_c - log_pf,
    log_pl_minus_log_pf = log_pl - log_pf,
    log_pk_minus_log_pf = log_pk - log_pf
  )

mod_r = lm(log_c_minus_log_pf ~ log_y + log_pl_minus_log_pf + log_pk_minus_log_pf, data = df)
ess_r = sum(mod_r$residuals^2)

# parameters
d = 1 # 1 constraint
n = nrow(df) # number of obs
k = length(mod$coefficients) - 1 # number of variables is all coefficients minus intercept

F_stat = ((ess_r - ess_u)/d)/(ess_u/(n-k-1))
F_stat

[1] 0.1282652

Got a very low F-statistic. Can plot this out on the distribution graph

f_dist = data.frame(dist = rf(n=n, df1=d, df2=(n-k-1)))
ggplot(data = f_dist, aes(x=dist)) + 
  geom_density(color="blue") + 
  geom_vline(xintercept = F_stat, color = "red", linetype = "dashed") + 
  theme_classic() + 
  scale_x_continuous(limits = c(0,10)) + 
  labs(x="Value",y="Density")

Warning: Removed 1 rows containing non-finite values (`stat_density()`).

Or ask what statistical significance level, or p-value, does this correspond to

1 - pf(F_stat, d, (n-k-1))

[1] 0.7210405

That is, we cannot reject \(H_0\) at conventional statistical levels.

Look above at the linearHypothesis command output, can see the F-statistic as well.

Q1.C

Now we have the following hypothesis: \[H_0: \sum_{i\in\{L,K,F\}}\beta_i = 1 \quad \& \quad \beta_Y = 1\] It follows from before that the restricted model is \[\begin{align} log(Cost) &= \log(B) + 1\log(Y) + \beta_L\log(P_L) + \beta_K\log(P_K) + (1-\beta_L-\beta_K)\log(P_F) + \omega \\ &= \log(B) + 1\log(Y) + \beta_L[\log(P_L)-\log(P_F)] + \beta_K[\log(P_K)-\log(P_F)] + 1\log(P_F) + \omega \end{align}\]

Calculate F-statistic:

car::linearHypothesis(model = mod, c("log_pl + log_pk + log_pf = 1",
                                     "log_y = 1"))

Linear hypothesis test

Hypothesis:
log_pl  + log_pk  + log_pf = 1
log_y = 1

Model 1: restricted model
Model 2: log_c ~ log_y + log_pl + log_pk + log_pf

  Res.Df     RSS Df Sum of Sq      F                Pr(>F)    
1     96 12.8296                                              
2     94  4.2235  2    8.6062 95.772 < 0.00000000000000022 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Q1.D

Predict at the means of the explaining vars:

df_pred = data.frame(
  log_pl = mean(df$log_pl),
  log_pk = mean(df$log_pk),
  log_pf = mean(df$log_pf),
  log_y = mean(df$log_y)
)

predict(mod, df_pred)

       1 
3.065995

Q2

Q2.i

Hypothesis of zero coefficients separately:

# Intermediate products
0.56/0.5

[1] 1.12

# Labor
0.4/0.65

[1] 0.6153846

# Capital 
0.18/0.2

[1] 0.9

Q2.ii

Hypothesis that all three coefficients are zero.

Can use the F-statistic for this test

\[F = \frac{R^2_U(n-k-1)}{(1-R^2_U)k} ~ F_{k,n - k - 1}\]

(0.75*(38-3-1))/((1-0.75)*3)

[1] 34

with distribution \(F_{3,34}\), this is significant with p-value

F_stat = (0.75*(38-3-1))/((1-0.75)*3)
d = 3; n = 38; k = 3
1 - pf(F_stat, d, (n-k-1))

[1] 0.000000000241855

f_dist = data.frame(dist = rf(n=n, df1=d, df2=(n-k-1)))
ggplot(data = f_dist, aes(x=dist)) + 
  geom_density(color="blue") + 
  geom_vline(xintercept = F_stat, color = "red", linetype = "dashed") + 
  theme_classic() + 
  scale_x_continuous(limits = c(0,40)) + 
  labs(x="Value",y="Density")

Proof that this is an F Statistic. First note that

\[ R^2 = 1 - ESS/TSS \leftrightarrow ESS/TSS = 1-R^2 \]

So using the standard F statistic and some algebra

\[\begin{align} F &= \frac{(ESS^R-ESS^U)/d}{ESS^U/(n-k-1)}\\ &= \frac{\frac{1}{TSS}(ESS^R-ESS^U)/d}{\frac{1}{TSS}ESS^U/(n-k-1)}\\ &= \frac{(\frac{ESS^R}{TSS}-\frac{ESS^U}{TSS})/d}{\frac{ESS^U}{TSS}/(n-k-1)}\\ &= \frac{(1-R^2_R-1+R^2_U)/d}{(1-R^2_U)/(n-k-1)} \end{align}\]

Finally, note that if the hypothesis is that all coefficients are zero, and hence an intercept model, we get by construction \(R^2_R = 0\). So

\[\begin{align} F &= \frac{(1-1+R^2_U)/d}{(1-R^2_U)/(n-k-1)}\\ &= \frac{R^2_U/d}{(1-R^2_U)/(n-k-1)}\\ &= \frac{R^2_U \times (n-k-1)}{(1-R^2_U)\times d}\\ \end{align}\]

Q2.iii

These are different hypotheses! Jointly, the coefficients explain the model well. But since they are correlated, alone they are not significant.

Q3

We are asked to preform a Chow-test.

Reminder: say we have a model, and two sub-populations.
We estimate the model in each sub-population separately.
We get 2 set of estimators.
Now we want to ask, are these sets equivalent?
This is the Chow-test.

We begin with a formal statement of the hypothesis.

Let there be two groups, \(P\) and \(G\).
Estimate the production function \(y = b_0 + b_zz+b_ll+b_kk+u\), separately by group.
Denote an estimator \(b_j^g\) as \(b_j\) when estimated using group \(g\).
This produces two sets of estimators: \(\{\hat{b}_0^P,\hat{b}_z^P,\hat{b}_l^P,\hat{b}_k^P\}\) and \(\{\hat{b}_0^G,\hat{b}_z^G,\hat{b}_l^G,\hat{b}_k^G\}\)
Want to test if same production function for \(P\) and \(G\), then hypothesis is \[H_0:b_i^P=b_i^G,\quad \forall i\in\{0,z,l,k\}\] The test for this is the following F-test: \[F=\frac{ESS_R-(ESS_P+ESS_G)/(k+1)}{(ESS_P+ESS_G)/(N_G+N_P-2k-2)}\sim F_{k+1,N_G+N_P-2k-2}\] Lets calculate this out of the table in the problem set.

# start with tss
tss_p = 308
tss_g = 240
tss_r = 650
# continue with R^2
r_sqr_p = .75
r_sqr_g = .66
r_sqr_r = .7

Note that \(R^2 = 1-ESS/TSS\leftrightarrow ESS = TSS\times(1-R^2)\)

# so calculate ess
ess_p = tss_p * (1-r_sqr_p)
ess_g = tss_g * (1-r_sqr_g)
ess_r = tss_r * (1-r_sqr_r)
# input other parameters
n_p = 38
n_g = 18
k = 3
# calculate F-statistic
F_stat = ((ess_r - (ess_g + ess_p))/(k+1))/((ess_p + ess_g)/(n_g + n_p - 2*k - 2))
F_stat

[1] 2.754098

# calculate p-value
1 - pf(F_stat, (k+1), (n_g + n_p - 2*k - 2))

[1] 0.03847922

So can reject \(H_0\) (both \(P\) and \(G\) do not have same production function) at 5%.

Q4

Assume \(F=4\) and \(Var(\hat{\beta}_2)=1\).

Un-restricted model: \[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + u\]
Restricted model: \[Y - X_2 = \beta_0 + \beta_1X_1 + u\]
This is equivalent to \[Y = \beta_0 + \beta_1X_1 + 1\times X_2 + u\]
Which is equivalent to the hypothesis: \[H_0:\beta_2 = 1\].

Since \(d=1\) (number of constraints), \(F=t^2\) and so \[t=\frac{\hat{\beta}_2-1}{Var(\hat{\beta}_2)}=\hat{\beta}_2-1=\sqrt{4}=\pm2\] Hence \(\hat{\beta}_2\in\{-1,3\}\).

Q5

Upload NBA data

df <- read.csv(glue::glue("{insert_data_path}/nbasalraw.csv"))[-1]
df |> tibble()

# A tibble: 269 × 23
     obs   age agesq assists allstar avgmin black center children  coll draft
   <int> <int> <int>   <dbl>   <int>  <dbl> <int>  <int>    <int> <int> <int>
 1     1    27   729   4.5         0  37.2      1      0        0     4    19
 2     2    28   784   8.80        0  35.8      1      0        1     4    28
 3     3    25   625   0.200       0  15.5      1      1        0     4    19
 4     4    28   784   1.5         0  25.1      1      0        0     4     1
 5     5    24   576   2.60        0  25.6      1      0        0     4    24
 6     6    31   961   1.5         0  24.0      1      0        0     4     4
 7     7    28   784   1.40        0  28.8      0      0        0     0    40
 8     8    27   729   0.700       0  16.9      1      0        0     3    47
 9     9    25   625   2           0   9.03     1      0        0     4    NA
10    10    35  1225   2.30        1  36.5      1      0        1     3     3
# … with 259 more rows, and 12 more variables: exper <int>, forward <int>,
#   expersq <int>, games <int>, guard <int>, lwage <dbl>, marr <int>,
#   marrblck <int>, minutes <int>, points <dbl>, rebounds <dbl>, wage <dbl>

Q5.1

Estimate model

summary(lm(points ~ exper + expersq + age + coll, data = df))


Call:
lm(formula = points ~ exper + expersq + age + coll, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.4736  -3.9849  -0.8246   3.6536  20.9790 

Coefficients:
            Estimate Std. Error t value     Pr(>|t|)    
(Intercept) 35.21831    6.98673   5.041 0.0000008621 ***
exper        2.36363    0.40550   5.829 0.0000000162 ***
expersq     -0.07703    0.02348  -3.280     0.001177 ** 
age         -1.07396    0.29507  -3.640     0.000328 ***
coll        -1.28625    0.45059  -2.855     0.004651 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.509 on 264 degrees of freedom
Multiple R-squared:  0.1412,    Adjusted R-squared:  0.1282 
F-statistic: 10.85 on 4 and 264 DF,  p-value: 0.00000003687

Q5.2

Asked about maximum value of experience.

From last Tirgul, equivalent to \[\frac{\partial points}{\partial exper} = \beta_{exper} + 2\beta_{expersq}exper\]
With maximum at \[\beta_{exper} + 2\beta_{expersq}exper = 0 \leftrightarrow exper = \frac{-\beta_{exper}}{2\beta_{expersq}}\]

mod = summary(lm(points ~ exper + expersq + age + coll, data = df))
beta_exper = mod$coefficients[2,1]
beta_expersq = mod$coefficients[3,1]
max_exper = (-beta_exper)/(2*beta_expersq)
max_exper

[1] 15.3429

Is this reasonable?

I would have thought that much sooner…
(if start at 20, best game at 35?)

Q5.3

Why is estimator of coll negative?

If possible to recruit before college,
Who is recreuited at high-school?
Best players!
Possible story why players recruited at college score less.

Q5.4

Add age squared

mod = summary(lm(points ~ exper + expersq + age + agesq + coll, data = df))
mod


Call:
lm(formula = points ~ exper + expersq + age + agesq + coll, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.2520  -4.0283  -0.8208   3.4792  20.8332 

Coefficients:
            Estimate Std. Error t value   Pr(>|t|)    
(Intercept) 73.59034   35.93341   2.048    0.04156 *  
exper        2.86383    0.61272   4.674 0.00000472 ***
expersq     -0.12807    0.05244  -2.442    0.01525 *  
age         -3.98369    2.68908  -1.481    0.13969    
agesq        0.05355    0.04919   1.089    0.27732    
coll        -1.31260    0.45108  -2.910    0.00392 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.508 on 263 degrees of freedom
Multiple R-squared:  0.1451,    Adjusted R-squared:  0.1288 
F-statistic: 8.925 on 5 and 263 DF,  p-value: 0.00000007615

Age squared isnt different from zero. So why add it?

beta_exper = mod$coefficients[2,1]
beta_expersq = mod$coefficients[3,1]
max_exper = (-beta_exper)/(2*beta_expersq)
max_exper

[1] 11.18051

Got a slightly more reasonable estimate for maximum effect of experience, accounting for age squared.

Q5.5

Predict log earnings.

mod = lm(lwage ~ points + exper + expersq + age + coll, data = df)
mod


Call:
lm(formula = lwage ~ points + exper + expersq + age + coll, data = df)

Coefficients:
(Intercept)       points        exper      expersq          age         coll  
   6.779039     0.077730     0.217845    -0.007082    -0.048137    -0.040271

Q5.6

Do age and col matter?

Looking at separate t-tests, no. Cannot reject that estimators are different from zero.
Are both of them different from zero? \[H_0:\beta_{age}=\beta{coll} = 0\]

linearHypothesis(model = mod,
                 c("age = 0", "coll = 0"))

Linear hypothesis test

Hypothesis:
age = 0
coll = 0

Model 1: restricted model
Model 2: lwage ~ points + exper + expersq + age + coll

  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1    265 107.59                           
2    263 106.63  2   0.96416 1.1891 0.3061

P-value of F-statistic of joint hypothesis is 0.3 (can reject \(H_0\) at \(\alpha = 30\%\)).

So no, not jointly significant after taking into account experience (and its square) and points scored.