Undergrad Metrics - PS7

Author

Dor Leventer

Published

December 1, 2022

Q1

Upload cost data, create variables

insert_data_path <- "/Users/dorleventer/Dropbox/teaching/undergrad_econometrics_spring_2023/data"
df = read.csv(glue::glue("{insert_data_path}/production.csv"))[-1] %>%
  mutate(
    log_pf = log(pf),
    log_pk = log(pk),
    log_pl = log(pl),
    log_y = log(y),
    log_c = log(cost)
  )

df |> tibble()
# A tibble: 99 × 11
     obs  cost    pf    pk    pl     y log_pf log_pk log_pl log_y  log_c
   <int> <dbl> <dbl> <dbl> <dbl> <int>  <dbl>  <dbl>  <dbl> <dbl>  <dbl>
 1     1 0.213  18    64.9 6869.     8   2.89   4.17   8.83  2.08 -1.55 
 2     2 0.489  34.2  86.1 5439.    14   3.53   4.46   8.60  2.64 -0.716
 3     3 0.616  32.1  90.5 9204.    50   3.47   4.51   9.13  3.91 -0.485
 4     4 0.761  28.5  41.2 8972.    65   3.35   3.72   9.10  4.17 -0.274
 5     5 0.636  25.4  58.3 6696.    67   3.23   4.06   8.81  4.20 -0.452
 6     6 1.15   21.5  79.1 7190.    90   3.07   4.37   8.88  4.50  0.137
 7     7 1.34   35.5  74.4 5063.   183   3.57   4.31   8.53  5.21  0.294
 8     8 2.26   39.2  71.9 8218.   295   3.67   4.28   9.01  5.69  0.815
 9     9 2.05   26.3  82.5 7885.   374   3.27   4.41   8.97  5.92  0.719
10    10 3.15   42.5  60.3 7895.   378   3.75   4.10   8.97  5.93  1.15 
# … with 89 more rows

Q1.A

Original equation.

\[C=BY^{\beta_Y}P_L^{\beta_L}P_K^{\beta_K}P_F^{\beta_F}e^\omega\]

Consider \(log(C)\): \[\begin{align} \log(C) &= \log(BY^{\beta_Y}P_L^{\beta_L}P_K^{\beta_K}P_F^{\beta_F}e^\omega) \\ &= \log(B) + \log(Y^{\beta_Y}) + \log(P_L^{\beta_L}) + \log(P_K^{\beta_K}) + \log(P_F^{\beta_F}) + \log(e^\omega) \\ &= \log(B) + \beta_Y\log(Y) + \beta_L\log(P_L) + \beta_K\log(P_K) + \beta_F\log(P_F) + \omega\log(e) \end{align}\]

Ended up with an equation that is linear in parameters. Estimate:

summary(lm(log_c ~ log_y + log_pl + log_pk + log_pf, data = df))

Call:
lm(formula = log_c ~ log_y + log_pl + log_pk + log_pf, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.49179 -0.12897 -0.00924  0.09748  0.86660 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept) -8.54875    1.39458  -6.130    0.000000020478882 ***
log_y        0.82410    0.01309  62.981 < 0.0000000000000002 ***
log_pl       0.17789    0.14256   1.248                0.215    
log_pk       0.20286    0.14457   1.403                0.164    
log_pf       0.69325    0.08071   8.590    0.000000000000182 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.212 on 94 degrees of freedom
Multiple R-squared:  0.9798,    Adjusted R-squared:  0.979 
F-statistic:  1141 on 4 and 94 DF,  p-value: < 0.00000000000000022

Q1.B

Consider production function (F is Fuel)

\[Y=AL^{\alpha_L}K^{\alpha_K}F^{\alpha_F}e^u\] We are told that \[\sum_j\alpha_j = r = 1/\beta_Y\]

Q1.B.1

Does the production function display a constant returns to scale? If \(r=1\leftrightarrow\beta_Y=1\).

Reminder, this is a similar procedure to single variable case: \[H_0:\beta_Y = 1\] which we test using the t statistic \[\frac{\hat{\beta}_Y-1}{\hat\sigma_{\hat{\beta}_Y}}=t\]

Test:

# Manually:
beta_y = summary(lm(log_c ~ log_y + log_pl + log_pk + log_pf, data = df))$coefficients[2,1]
sigma_beta_y = summary(lm(log_c ~ log_y + log_pl + log_pk + log_pf, data = df))$coefficients[2,2]
t = (beta_y-1)/sigma_beta_y
t
[1] -13.44271
# Very small, so we reject H0.

# How do get p-value / significance?
degrees_freedom = summary(lm(log_c ~ log_y + log_pl + log_pk + log_pf, data = df))$df[2]
pt(t, degrees_freedom)
[1] 0.000000000000000000000006486226
# very small p-value, reject H0 that beta_y is equal 1.

# How to test automatically?
mod = lm(log_c ~ log_y + log_pl + log_pk + log_pf, data = df)
test = car::linearHypothesis(model = mod, "log_y = 1")
test
Linear hypothesis test

Hypothesis:
log_y = 1

Model 1: restricted model
Model 2: log_c ~ log_y + log_pl + log_pk + log_pf

  Res.Df     RSS Df Sum of Sq      F                Pr(>F)    
1     95 12.3427                                              
2     94  4.2235  1    8.1192 180.71 < 0.00000000000000022 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# got an F statistic, can get the t using its root
sqrt(test$F[2])
[1] 13.44271

Q1.B.2

Is \(Cost(Y,P_L,P_K,P_F)\) a homogeneous function in the prices of degree 1?

This is true if

\[\begin{align} Cost(Y,hP_L,hP_K,hP_F) &= h^1Cost(Y,P_L,P_K,P_F) \\ \leftrightarrow BY^{\beta_Y}(hP_L)^{\beta_L}(hP_K)^{\beta_K}(hP_F)^{\beta_F}e^\omega &= h^{\beta_L}h^{\beta_K}h^{\beta_F}BY^{\beta_Y}P_L^{\beta_L}P_K^{\beta_K}P_F^{\beta_F}e^\omega \\ &=h^{\sum_{i\in\{L,K,F\}}\beta_i}Cost(Y,P_L,P_K,P_F) \end{align}\]

So for the above to be true, it needs to hold that \[\sum_{i\in\{L,K,F\}}\beta_i = 1\]

What hypothesis is this equivalent to? \[H_0: \sum_{i\in\{L,K,F\}}\beta_i - 1 = 0\]

Q1.B.2 - t-test

  • Calculation procedure in slides

Write as

\[t = \frac{\beta_L+\beta_K+\beta_F - 1}{\sqrt{Var(\beta_L+\beta_K+\beta_F)}}\]

Whats in the denominator?

\[Var(\beta_L+\beta_K+\beta_F - 1) = Var(\beta_L) + Var(\beta_K) + Var(\beta_F) + 2(Cov(\beta_L,\beta_K) + Cov(\beta_L,\beta_F) + Cov(\beta_K,\beta_F)\]

Lets estimate this.

## begin with manual calculation

# coefficients estimation
beta_L = mod$coefficients[[3]] # two parentheses to avoid name recording in values
beta_K = mod$coefficients[[4]]
beta_F = mod$coefficients[[5]]
# variance covariance estimation
cov_mat <- vcov(mod, data = df)[-1,][,-1]
cov_mat
                log_y        log_pl        log_pk         log_pf
log_y   0.00017121763 -0.0002824858 -0.0003755597 -0.00005281364
log_pl -0.00028248577  0.0203236223  0.0020192218 -0.00348960591
log_pk -0.00037555967  0.0020192218  0.0209011076 -0.00105158318
log_pf -0.00005281364 -0.0034896059 -0.0010515832  0.00651369276
sigma_LL = cov_mat[2,2]
sigma_KK = cov_mat[3,3]
sigma_FF = cov_mat[4,4]
sigma_LK = cov_mat[2,3]
sigma_LF = cov_mat[2,4]
sigma_KF = cov_mat[3,4]
# now can calculate the t-statistics for the hypthesis
t = (beta_L + beta_K + beta_F - 1)/((sigma_LL + sigma_KK + sigma_FF +
                                       2*(sigma_LK + sigma_LF + sigma_KF)))^0.5
t
[1] 0.3581413
## Now for the automatic version, calculate an F-statistic and compute its square
test = car::linearHypothesis(model = mod, "log_pl + log_pk + log_pf = 1")
test
Linear hypothesis test

Hypothesis:
log_pl  + log_pk  + log_pf = 1

Model 1: restricted model
Model 2: log_c ~ log_y + log_pl + log_pk + log_pf

  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1     95 4.2292                           
2     94 4.2235  1  0.005763 0.1283  0.721
sqrt(test$F[2])
[1] 0.3581413

Q1.B.2 - F-test

Can write the hypothesis like so: \[\beta_F = 1-\beta_L-\beta_K\].

Insert into model to get the restricted model (denote with \(R\)) \[\begin{align} log(Cost) &= \log(B) + \beta_Y\log(Y) + \beta_L\log(P_L) + \beta_K\log(P_K) + (1-\beta_L-\beta_K)\log(P_F) + \omega \\ &= \log(B) + \beta_Y\log(Y) + \beta_L[\log(P_L)-\log(P_F)] + \beta_K[\log(P_K)-\log(P_F)] + 1\log(P_F) + \omega \end{align}\]

Calculate for model \(R\) and un-restricted model (denote with \(U\)) the following sum of squared errors \[ESS^R = \sum (\hat{\omega}^R)^2\quad\&\quad ESS^U = \sum (\hat{\omega}^U)^2\] And a corresponding F-statistic \[F=\frac{(ESS^R-ESS^U)/d}{ESS^U/(n-k-1)}\sim F_{d,n-k-1}\] where

  • \(d\) is number of constraints, and
  • \(k\) is number of variables, so \(n-k-1\) is number of degrees of freedom

Calculate F-statistic for this hypothesis

# un-restricted model and ess
mod_u = mod
ess_u = sum(mod_u$residuals^2)

# now do a restricted model estimation
df = df %>%
  mutate(
    log_c_minus_log_pf = log_c - log_pf,
    log_pl_minus_log_pf = log_pl - log_pf,
    log_pk_minus_log_pf = log_pk - log_pf
  )

mod_r = lm(log_c_minus_log_pf ~ log_y + log_pl_minus_log_pf + log_pk_minus_log_pf, data = df)
ess_r = sum(mod_r$residuals^2)

# parameters
d = 1 # 1 constraint
n = nrow(df) # number of obs
k = length(mod$coefficients) - 1 # number of variables is all coefficients minus intercept

F_stat = ((ess_r - ess_u)/d)/(ess_u/(n-k-1))
F_stat
[1] 0.1282652

Got a very low F-statistic. Can plot this out on the distribution graph

f_dist = data.frame(dist = rf(n=n, df1=d, df2=(n-k-1)))
ggplot(data = f_dist, aes(x=dist)) + 
  geom_density(color="blue") + 
  geom_vline(xintercept = F_stat, color = "red", linetype = "dashed") + 
  theme_classic() + 
  scale_x_continuous(limits = c(0,10)) + 
  labs(x="Value",y="Density")
Warning: Removed 1 rows containing non-finite values (`stat_density()`).

Or ask what statistical significance level, or p-value, does this correspond to

1 - pf(F_stat, d, (n-k-1))
[1] 0.7210405

That is, we cannot reject \(H_0\) at conventional statistical levels.

Look above at the linearHypothesis command output, can see the F-statistic as well.

Q1.C

Now we have the following hypothesis: \[H_0: \sum_{i\in\{L,K,F\}}\beta_i = 1 \quad \& \quad \beta_Y = 1\] It follows from before that the restricted model is \[\begin{align} log(Cost) &= \log(B) + 1\log(Y) + \beta_L\log(P_L) + \beta_K\log(P_K) + (1-\beta_L-\beta_K)\log(P_F) + \omega \\ &= \log(B) + 1\log(Y) + \beta_L[\log(P_L)-\log(P_F)] + \beta_K[\log(P_K)-\log(P_F)] + 1\log(P_F) + \omega \end{align}\]

Calculate F-statistic:

car::linearHypothesis(model = mod, c("log_pl + log_pk + log_pf = 1",
                                     "log_y = 1"))
Linear hypothesis test

Hypothesis:
log_pl  + log_pk  + log_pf = 1
log_y = 1

Model 1: restricted model
Model 2: log_c ~ log_y + log_pl + log_pk + log_pf

  Res.Df     RSS Df Sum of Sq      F                Pr(>F)    
1     96 12.8296                                              
2     94  4.2235  2    8.6062 95.772 < 0.00000000000000022 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Q1.D

Predict at the means of the explaining vars:

df_pred = data.frame(
  log_pl = mean(df$log_pl),
  log_pk = mean(df$log_pk),
  log_pf = mean(df$log_pf),
  log_y = mean(df$log_y)
)

predict(mod, df_pred)
       1 
3.065995 

Q2

Q2.i

Hypothesis of zero coefficients separately:

# Intermediate products
0.56/0.5
[1] 1.12
# Labor
0.4/0.65
[1] 0.6153846
# Capital 
0.18/0.2
[1] 0.9

Q2.ii

Hypothesis that all three coefficients are zero.

Can use the F-statistic for this test

\[F = \frac{R^2_U(n-k-1)}{(1-R^2_U)k} ~ F_{k,n - k - 1}\]

(0.75*(38-3-1))/((1-0.75)*3)
[1] 34

with distribution \(F_{3,34}\), this is significant with p-value

F_stat = (0.75*(38-3-1))/((1-0.75)*3)
d = 3; n = 38; k = 3
1 - pf(F_stat, d, (n-k-1))
[1] 0.000000000241855
f_dist = data.frame(dist = rf(n=n, df1=d, df2=(n-k-1)))
ggplot(data = f_dist, aes(x=dist)) + 
  geom_density(color="blue") + 
  geom_vline(xintercept = F_stat, color = "red", linetype = "dashed") + 
  theme_classic() + 
  scale_x_continuous(limits = c(0,40)) + 
  labs(x="Value",y="Density")

Proof that this is an F Statistic. First note that

\[ R^2 = 1 - ESS/TSS \leftrightarrow ESS/TSS = 1-R^2 \]

So using the standard F statistic and some algebra

\[\begin{align} F &= \frac{(ESS^R-ESS^U)/d}{ESS^U/(n-k-1)}\\ &= \frac{\frac{1}{TSS}(ESS^R-ESS^U)/d}{\frac{1}{TSS}ESS^U/(n-k-1)}\\ &= \frac{(\frac{ESS^R}{TSS}-\frac{ESS^U}{TSS})/d}{\frac{ESS^U}{TSS}/(n-k-1)}\\ &= \frac{(1-R^2_R-1+R^2_U)/d}{(1-R^2_U)/(n-k-1)} \end{align}\]

Finally, note that if the hypothesis is that all coefficients are zero, and hence an intercept model, we get by construction \(R^2_R = 0\). So

\[\begin{align} F &= \frac{(1-1+R^2_U)/d}{(1-R^2_U)/(n-k-1)}\\ &= \frac{R^2_U/d}{(1-R^2_U)/(n-k-1)}\\ &= \frac{R^2_U \times (n-k-1)}{(1-R^2_U)\times d}\\ \end{align}\]

Q2.iii

These are different hypotheses! Jointly, the coefficients explain the model well. But since they are correlated, alone they are not significant.

Q3

We are asked to preform a Chow-test.

  • Reminder: say we have a model, and two sub-populations.
  • We estimate the model in each sub-population separately.
  • We get 2 set of estimators.
  • Now we want to ask, are these sets equivalent?
  • This is the Chow-test.

We begin with a formal statement of the hypothesis.

  • Let there be two groups, \(P\) and \(G\).
  • Estimate the production function \(y = b_0 + b_zz+b_ll+b_kk+u\), separately by group.
  • Denote an estimator \(b_j^g\) as \(b_j\) when estimated using group \(g\).
  • This produces two sets of estimators: \(\{\hat{b}_0^P,\hat{b}_z^P,\hat{b}_l^P,\hat{b}_k^P\}\) and \(\{\hat{b}_0^G,\hat{b}_z^G,\hat{b}_l^G,\hat{b}_k^G\}\)
  • Want to test if same production function for \(P\) and \(G\), then hypothesis is \[H_0:b_i^P=b_i^G,\quad \forall i\in\{0,z,l,k\}\] The test for this is the following F-test: \[F=\frac{ESS_R-(ESS_P+ESS_G)/(k+1)}{(ESS_P+ESS_G)/(N_G+N_P-2k-2)}\sim F_{k+1,N_G+N_P-2k-2}\] Lets calculate this out of the table in the problem set.
# start with tss
tss_p = 308
tss_g = 240
tss_r = 650
# continue with R^2
r_sqr_p = .75
r_sqr_g = .66
r_sqr_r = .7

Note that \(R^2 = 1-ESS/TSS\leftrightarrow ESS = TSS\times(1-R^2)\)

# so calculate ess
ess_p = tss_p * (1-r_sqr_p)
ess_g = tss_g * (1-r_sqr_g)
ess_r = tss_r * (1-r_sqr_r)
# input other parameters
n_p = 38
n_g = 18
k = 3
# calculate F-statistic
F_stat = ((ess_r - (ess_g + ess_p))/(k+1))/((ess_p + ess_g)/(n_g + n_p - 2*k - 2))
F_stat
[1] 2.754098
# calculate p-value
1 - pf(F_stat, (k+1), (n_g + n_p - 2*k - 2))
[1] 0.03847922

So can reject \(H_0\) (both \(P\) and \(G\) do not have same production function) at 5%.

Q4

Assume \(F=4\) and \(Var(\hat{\beta}_2)=1\).

  • Un-restricted model: \[Y = \beta_0 + \beta_1X_1 + \beta_2X_2 + u\]
  • Restricted model: \[Y - X_2 = \beta_0 + \beta_1X_1 + u\]
  • This is equivalent to \[Y = \beta_0 + \beta_1X_1 + 1\times X_2 + u\]
  • Which is equivalent to the hypothesis: \[H_0:\beta_2 = 1\].

Since \(d=1\) (number of constraints), \(F=t^2\) and so \[t=\frac{\hat{\beta}_2-1}{Var(\hat{\beta}_2)}=\hat{\beta}_2-1=\sqrt{4}=\pm2\] Hence \(\hat{\beta}_2\in\{-1,3\}\).

Q5

Upload NBA data

df <- read.csv(glue::glue("{insert_data_path}/nbasalraw.csv"))[-1]
df |> tibble()
# A tibble: 269 × 23
     obs   age agesq assists allstar avgmin black center children  coll draft
   <int> <int> <int>   <dbl>   <int>  <dbl> <int>  <int>    <int> <int> <int>
 1     1    27   729   4.5         0  37.2      1      0        0     4    19
 2     2    28   784   8.80        0  35.8      1      0        1     4    28
 3     3    25   625   0.200       0  15.5      1      1        0     4    19
 4     4    28   784   1.5         0  25.1      1      0        0     4     1
 5     5    24   576   2.60        0  25.6      1      0        0     4    24
 6     6    31   961   1.5         0  24.0      1      0        0     4     4
 7     7    28   784   1.40        0  28.8      0      0        0     0    40
 8     8    27   729   0.700       0  16.9      1      0        0     3    47
 9     9    25   625   2           0   9.03     1      0        0     4    NA
10    10    35  1225   2.30        1  36.5      1      0        1     3     3
# … with 259 more rows, and 12 more variables: exper <int>, forward <int>,
#   expersq <int>, games <int>, guard <int>, lwage <dbl>, marr <int>,
#   marrblck <int>, minutes <int>, points <dbl>, rebounds <dbl>, wage <dbl>

Q5.1

Estimate model

summary(lm(points ~ exper + expersq + age + coll, data = df))

Call:
lm(formula = points ~ exper + expersq + age + coll, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.4736  -3.9849  -0.8246   3.6536  20.9790 

Coefficients:
            Estimate Std. Error t value     Pr(>|t|)    
(Intercept) 35.21831    6.98673   5.041 0.0000008621 ***
exper        2.36363    0.40550   5.829 0.0000000162 ***
expersq     -0.07703    0.02348  -3.280     0.001177 ** 
age         -1.07396    0.29507  -3.640     0.000328 ***
coll        -1.28625    0.45059  -2.855     0.004651 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.509 on 264 degrees of freedom
Multiple R-squared:  0.1412,    Adjusted R-squared:  0.1282 
F-statistic: 10.85 on 4 and 264 DF,  p-value: 0.00000003687

Q5.2

Asked about maximum value of experience.

  • From last Tirgul, equivalent to \[\frac{\partial points}{\partial exper} = \beta_{exper} + 2\beta_{expersq}exper\]
  • With maximum at \[\beta_{exper} + 2\beta_{expersq}exper = 0 \leftrightarrow exper = \frac{-\beta_{exper}}{2\beta_{expersq}}\]
mod = summary(lm(points ~ exper + expersq + age + coll, data = df))
beta_exper = mod$coefficients[2,1]
beta_expersq = mod$coefficients[3,1]
max_exper = (-beta_exper)/(2*beta_expersq)
max_exper
[1] 15.3429

Is this reasonable?

  • I would have thought that much sooner…
  • (if start at 20, best game at 35?)

Q5.3

Why is estimator of coll negative?

  • If possible to recruit before college,
  • Who is recreuited at high-school?
  • Best players!
  • Possible story why players recruited at college score less.

Q5.4

Add age squared

mod = summary(lm(points ~ exper + expersq + age + agesq + coll, data = df))
mod

Call:
lm(formula = points ~ exper + expersq + age + agesq + coll, data = df)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.2520  -4.0283  -0.8208   3.4792  20.8332 

Coefficients:
            Estimate Std. Error t value   Pr(>|t|)    
(Intercept) 73.59034   35.93341   2.048    0.04156 *  
exper        2.86383    0.61272   4.674 0.00000472 ***
expersq     -0.12807    0.05244  -2.442    0.01525 *  
age         -3.98369    2.68908  -1.481    0.13969    
agesq        0.05355    0.04919   1.089    0.27732    
coll        -1.31260    0.45108  -2.910    0.00392 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.508 on 263 degrees of freedom
Multiple R-squared:  0.1451,    Adjusted R-squared:  0.1288 
F-statistic: 8.925 on 5 and 263 DF,  p-value: 0.00000007615

Age squared isnt different from zero. So why add it?

beta_exper = mod$coefficients[2,1]
beta_expersq = mod$coefficients[3,1]
max_exper = (-beta_exper)/(2*beta_expersq)
max_exper
[1] 11.18051

Got a slightly more reasonable estimate for maximum effect of experience, accounting for age squared.

Q5.5

Predict log earnings.

mod = lm(lwage ~ points + exper + expersq + age + coll, data = df)
mod

Call:
lm(formula = lwage ~ points + exper + expersq + age + coll, data = df)

Coefficients:
(Intercept)       points        exper      expersq          age         coll  
   6.779039     0.077730     0.217845    -0.007082    -0.048137    -0.040271  

Q5.6

Do age and col matter?

  • Looking at separate t-tests, no. Cannot reject that estimators are different from zero.
  • Are both of them different from zero? \[H_0:\beta_{age}=\beta{coll} = 0\]
linearHypothesis(model = mod,
                 c("age = 0", "coll = 0"))
Linear hypothesis test

Hypothesis:
age = 0
coll = 0

Model 1: restricted model
Model 2: lwage ~ points + exper + expersq + age + coll

  Res.Df    RSS Df Sum of Sq      F Pr(>F)
1    265 107.59                           
2    263 106.63  2   0.96416 1.1891 0.3061

P-value of F-statistic of joint hypothesis is 0.3 (can reject \(H_0\) at \(\alpha = 30\%\)).

  • So no, not jointly significant after taking into account experience (and its square) and points scored.