BASTIENPatras_PS3Empirics

$~$

Online version of the code $\mid$ https://rpubs.com/Patras/730416

$~$

1 $\mid$ Load STATA dataset ee2002ext.dta

2 $\mid$ Create the variable log(salfr) and trim the sample below the $0.5^{th}$ percentile and above the $99.5^{th}$ one.

2.1 $\mid\mid$ Remove observations corresponding to individuals still in school (ddipl1=7)

3 $\mid$ Generate age squared (age is agd)

$~$

#Loading the required libraries

library(haven)
library(tidyverse)
library(tidyr)
library(stargazer) 
library(xtable) 
library(ggplot2)
library(kableExtra)

# Loading the dataset

data <- read_dta("/Users/bastienpatras/Desktop/Sciences Po - Master in Economics/Econometrics/Pbset 3/ee2002ext.dta")

data_1 <- data %>% na.omit() %>%
                   select(salfr, ddipl1, s, agd) %>%
                   rename(Income = salfr, Education = ddipl1, Gender = s, Age = agd) %>%
                   mutate(Log_Income = log(Income),
                          Age_squared = Age^2) %>%
                   filter(Log_Income > quantile(Log_Income, probs = 0.005), 
                          Log_Income < quantile(Log_Income, probs = 0.995), 
                          Education!= 7)

head(data_1) %>% kbl(booktabs = T) %>%
             kable_styling(latex_options = "striped", full_width = T)

Income	Education	Gender	Age	Log_Income	Age_squared
5523	3	2	47	8.616677	2209
7500	5	2	28	8.922658	784
14000	3	1	56	9.546813	3136
10800	2	1	43	9.287301	1849
7800	3	2	42	8.961879	1764
8800	3	1	52	9.082507	2704

$~$

4 $\mid$ Regress log(salfr) on gender (variable s), age, age squared and ddipl1.

$~$

# Linear regression : log(salfr) on gender (variable s), age, age squared and ddipl1

model_1 <- lm(Log_Income ~ as.factor(Gender) + Age + 
                Age_squared + as.factor(Education), data = data_1)

# Output

summary(model_1)

## 
## Call:
## lm(formula = Log_Income ~ as.factor(Gender) + Age + Age_squared + 
##     as.factor(Education), data = data_1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6508 -0.1705  0.0518  0.2604  1.7863 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            7.565e+00  4.751e-02  159.23   <2e-16 ***
## as.factor(Gender)2    -3.340e-01  6.717e-03  -49.73   <2e-16 ***
## Age                    5.362e-02  2.443e-03   21.95   <2e-16 ***
## Age_squared           -5.147e-04  3.031e-05  -16.98   <2e-16 ***
## as.factor(Education)2  2.700e-01  1.399e-02   19.30   <2e-16 ***
## as.factor(Education)3  2.314e-01  9.347e-03   24.76   <2e-16 ***
## as.factor(Education)4  4.149e-01  1.145e-02   36.25   <2e-16 ***
## as.factor(Education)5  6.023e-01  1.166e-02   51.64   <2e-16 ***
## as.factor(Education)6  8.225e-01  1.172e-02   70.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4877 on 21564 degrees of freedom
## Multiple R-squared:  0.2872, Adjusted R-squared:  0.2869 
## F-statistic:  1086 on 8 and 21564 DF,  p-value: < 2.2e-16

$~$

5 $\mid$ Interpret the regression output.

$~$

The way to interpret categorical variable coefficients is the following :

$~$

Being a woman (coded as 2) decreases the log_salary by -3.340e-01% and thus decreases the income by $e^{-3.340e-01}\%$. The interpretation for the coefficients in front of education is the same : for instance having a middle school degree (coded as 2) increases the log_salary by 2.700e-01% and the salary by $e^{2.700e-01}$ (demo in the theoretical part)

$~$

The way to interpret numerical variable coefficients is the following :

$~$

An increase in one unit of Age, thus one more year, increases the log_salary by 2.700e-01% an the salary (unlogged) by $e^{2.700e-01}\%$

$~$

The output from the lm function gives us an idea of the statistical significance (statistically different than zero) of each coefficients, in our case each variables of the model are statistical significant, which means that they are useful in explaining log_salary

$~$

The output from the lm function provides us with the $R^2$ and more importantly the $R^2_{\text{adjusted}} = 0.2869$ which corrects the $R^2$ from the effect of adding variables in the model. In our case, the adjusted R-squared is quite low, it means that only 28.69% of the variation of log_salary is explained by the model.

$~$

Finally, the output from the lm function provides us with the $F_{\text{stat}}$ statistic and more importantly the result from the F-test (testing whether all coefficients are equal to 0) on the global significance of the model $F_{\text{stat}} = 1086$. As the the p-value is extremely small, we can safely assume the global significance of the model.

$~$

6 $\mid$ Create global variables N, K + 1 and SSR.

$~$

As we know that :

\[ \text{SST} = \text{SSE} + \text{SSR} \Longleftrightarrow \sum_{i=1}^{n}\left(y_i-\bar{y}\right)^2 = \sum_{i=1}^{n}\left(\hat{y}_i-\bar{y}\right)^2 + \sum_{i=1}^{n}\left(\hat{y}_i-y_i\right)^2 \]

We can easily compute the SSR, N which corresponds to the number of observations and K, the degree of freedom that corresponds to the number of independent variables that enter linearly in the model.

$~$

# Computing SSR, N and K+1 from data_1 dataset

SRR_table <- data_1 %>%
             summarise(SST = sum((Log_Income-mean(Log_Income))^2), 
                       SSE = sum((fitted(model_1)-mean(Log_Income))^2),
                       SSR = SST - SSE,
                       N = length(Log_Income),
                       'K+1' = 8+1)

# Output

SRR_table %>% kbl(booktabs = T) %>%
             kable_styling(latex_options = "striped", full_width = T) %>%
             column_spec(3, color = "red")

SST	SSE	SSR	N	K+1
7196.053	2066.75	5129.303	21573	9

$~$

7 $\mid$ Test the null hypothesis that there is no effect of education

$~$

We want to test the hypothesis that there is no effect of education, therefore we can set up the following bilateral test :

\[ \begin{cases} H_0 : \beta_{\text{Education}} &= 0 \\ H_1 : \beta_{\text{Education}} &\neq 0 \end{cases} \]

$~$

As our data set provides us with the specific distribution of the level of education by type of degree obtained, the significance is given by the t-statistic in the summary computed above. In our case, the $Pr(>|t|)$ states that having a degree (whatever the type of degree) is statistically different than 0.

$~$

8 $\mid$ Test the null hypothesis that there is no effect of education

$~$

We want to test the hypothesis that there is the effect of having an high school degree is the same as having a middle school degree, therefore we can set up the following bilateral test :

$~$

\[ \begin{cases} H_0 : \beta_{(\text{Education == 2})} = \beta_{(\text{Education == 3})} \\ H_1 : \beta_{(\text{Education == 2})} \neq \beta_{(\text{Education == 3})} \end{cases} \Longleftrightarrow \begin{cases} H_0 : \beta_{(\text{Education == 2})} -\beta_{(\text{Education == 3})} = 0 \\ H_1 : \beta_{(\text{Education == 2})} - \beta_{(\text{Education == 3})} \neq 0 \end{cases}\]

$~$

The statistic of this test corresponds to :

$~$

\[ T = \frac{\left(\hat{\beta}_{(\text{Education == 2})}-\hat{\beta}_{(\text{Education == 3})}\right) - (\beta_1-\beta_2) = 0}{\hat{\sigma}\left[\hat{\beta}_{(\text{Education == 2})}-\hat{\beta}_{(\text{Education == 3})}\right]} = \frac{\left(\hat{\beta}_{(\text{Education == 2})}-\hat{\beta}_{(\text{Education == 3})}\right)}{\hat{\sigma}\left[\hat{\beta}_{(\text{Education == 2})}-\hat{\beta}_{(\text{Education == 3})}\right]} \]

$~$

# Extracting the coefficients of interest 

beta_Educ_2 <- model_1$coef[5]
beta_Educ_3 <- model_1$coef[6]

$~$

We can then compute the value of the standard error we need :

\[ \sigma(\hat{\beta}_{(\text{Education == 2})}-\hat{\beta}_{(\text{Education == 3})}) = \sqrt{\text{Var}(\hat{\beta}_{(\text{Education == 2})}) + \text{Var}(\hat{\beta}_{(\text{Education == 3})})-2\times \text{Var}(\hat{\beta}_{(\text{Education == 2})},\hat{\beta}_{(\text{Education == 3})})} \]

$~$

# Extracting the corresponding variances 

var_beta_Educ_2 <- vcov(model_1)[5,5]
var_beta_Educ_3 <- vcov(model_1)[6,6]

# Extracting the corresponding covariances 

covar_beta_Educ_2_Educ_3 <- vcov(model_1)[5,6]

# Computing sigma :

sigma_Educ_2_Educ_3 <- sqrt(var_beta_Educ_2-var_beta_Educ_3+2*covar_beta_Educ_2_Educ_3)

$~$

The statistic and the decision rule corresponding to this test :

$~$

\[\begin{cases} \left(T > t_{1-\frac{\alpha}{2}} \textbf{ } \cup \textbf{ }T < - t_{1-\frac{\alpha}{2}}\right)& \text{: We reject $H_0$ at $\alpha = 0.05$} \\ T < t_{1-\frac{\alpha}{2}} & \text{: We do not reject $H_0$ at $\alpha = 0.05 $} \end{cases}\]

$~$

# Setting the level of alpha

alpha_level <- 0.05

# Extracting the corresponding variances 

var_beta_Educ_2 <- vcov(model_1)[5,5]
var_beta_Educ_3 <- vcov(model_1)[6,6]

# Computing the t-statistic 

t_stat = (var_beta_Educ_2-var_beta_Educ_3)/(sigma_Educ_2_Educ_3)

t_test <- qt(1-alpha_level/2, df.residual(model_1))

# Decision rule from the student distribution

if (t_stat < t_test) {
print("Do not reject H_0 at alpha = 0.05")
} else {
print("Reject H_0 at alpha = 0.05")
}

## [1] "Do not reject H_0 at alpha = 0.05"

$~$

9 $\mid$ Estimate the model under the null that the effects of (ddipl1==2) and (ddipl1==3) are equal.

$~$

Under the assumption that $\hat{\beta}_{(\text{Education == 2})} = \hat{\beta}_{(\text{Education == 3})}$ we can factorized the $\hat{\beta}_{(\text{Education == 2})}$ in front of $\text{Education == 2}$ and $\text{Education == 3}$ in the linear model.

$~$

# Transform the dataset with the new variable

data_2 <- data_1 %>% 
  mutate(Education_1 = if_else(Education == 1, 1, 0),
         Education_2 = if_else(Education == 2, 1, 0), 
         Education_3 = if_else(Education == 3, 1, 0),
         Education_4 = if_else(Education == 4, 1, 0),
         Education_5 = if_else(Education == 5, 1, 0),
         Education_6 = if_else(Education == 6, 1, 0),
         Education_3_2 = Education_3+Education_2)

# Running the model with the transformed dataset

model_2 <- lm(Log_Income ~ as.factor(Gender) + Age + 
                Age_squared + 
                as.factor(Education_3_2) +
                as.factor(Education_4) +
                as.factor(Education_5) +
                as.factor(Education_6), 
              data = data_2)

# Output

summary(model_2)

## 
## Call:
## lm(formula = Log_Income ~ as.factor(Gender) + Age + Age_squared + 
##     as.factor(Education_3_2) + as.factor(Education_4) + as.factor(Education_5) + 
##     as.factor(Education_6), data = data_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6500 -0.1705  0.0520  0.2608  1.7847 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                7.566e+00  4.752e-02  159.23   <2e-16 ***
## as.factor(Gender)2        -3.329e-01  6.707e-03  -49.64   <2e-16 ***
## Age                        5.350e-02  2.443e-03   21.90   <2e-16 ***
## Age_squared               -5.128e-04  3.031e-05  -16.92   <2e-16 ***
## as.factor(Education_3_2)1  2.395e-01  8.893e-03   26.93   <2e-16 ***
## as.factor(Education_4)1    4.150e-01  1.145e-02   36.25   <2e-16 ***
## as.factor(Education_5)1    6.024e-01  1.166e-02   51.64   <2e-16 ***
## as.factor(Education_6)1    8.226e-01  1.172e-02   70.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4878 on 21565 degrees of freedom
## Multiple R-squared:  0.2869, Adjusted R-squared:  0.2867 
## F-statistic:  1240 on 7 and 21565 DF,  p-value: < 2.2e-16

$~$

10 $\mid$ Regress log(salfr) on gender (variable s), age, age squared

$~$

# Linear regression : log(salfr) on gender (variable s), age, age and squared

model_3 <- lm(Log_Income ~ as.factor(Gender) + Age + 
                Age_squared, data = data_1)

# Output

summary(model_3)

## 
## Call:
## lm(formula = Log_Income ~ as.factor(Gender) + Age + Age_squared, 
##     data = data_1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1712 -0.2241  0.0163  0.2929  1.8426 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         7.973e+00  5.312e-02  150.08   <2e-16 ***
## as.factor(Gender)2 -2.825e-01  7.525e-03  -37.53   <2e-16 ***
## Age                 5.392e-02  2.758e-03   19.55   <2e-16 ***
## Age_squared        -5.831e-04  3.421e-05  -17.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5521 on 21569 degrees of freedom
## Multiple R-squared:  0.08648,    Adjusted R-squared:  0.08635 
## F-statistic: 680.6 on 3 and 21569 DF,  p-value: < 2.2e-16

$~$

11 $\mid$ Create a global variable containing SSR for that model

$~$

# Linear regression : log(salfr) on gender (variable s), age, age and squared

model_3 <- lm(Log_Income ~ as.factor(Gender) + Age + 
                Age_squared, data = data_1)


# Adding the predicted values to the transformed dataset : 

SRR_table_2 <- data_1 %>%
             summarise(SST = sum((Log_Income-mean(Log_Income))^2), 
                       SSE = sum((fitted(model_1)-mean(Log_Income))^2),
                       SSE_cm = sum((fitted(model_3)-mean(Log_Income))^2),
                       SSR    = SST - SSE,
                       SSR_cm = SST - SSE_cm,
                       N = length(Log_Income),
                       'K+1' = 4+1)

SRR_table_2 %>% kbl(booktabs = T) %>%
             kable_styling(latex_options = "striped", full_width = T) %>%
             column_spec(5, color = "red")

SST	SSE	SSE_cm	SSR	SSR_cm	N	K+1
7196.053	2066.75	622.3263	5129.303	6573.727	21573	5

$~$

12 $\mid$ Create a global variable containing the Fisher statistic for the test that education has no effect. Compare with Stata output.

$~$

The Fisher statistic is given such that :

\[ F_{\text{stat}} = \frac{\text{SSR}_{\text{cm}} - \text{SSR}}{\text{SSR}} \times \frac{N-(K+1)}{K} \]

# Creating the Fisher statistic for the test

F_stat <- SRR_table_2 %>%
          summarise(F_stat = ((SSR_cm-SSR)/4)/(SSR/(N-5)) )

F_stat %>% kbl(booktabs = T) %>%
             kable_styling(latex_options = "striped", full_width = T) %>%
             column_spec(1, color = "red")

F_stat
1518.4

$~$

Our computed F-statistic differs slightly from the one computed thanks to the lm() function. The difference between this two results is explained by the number of observations and the degrees of freedom.

$~$

13 $\mid$ Ask Stata to provide the $95^{th}$ percentile of the Fisher distribution with the right degrees of freedom for the test. (Use command “display” and the Fisher CDF function.) What do you deduce?

$~$

# Computing theoretical F-stat 

q_F <- qf(0.95, df1 = 5, df2 = length(data_1$Log_Income))

# Decision rule for the global significance test 

if (q_F > F_stat$F_stat) {
print("Do not reject H_0 at alpha = 0.05, thus Education has no effect on Log_Income")
} else {
print("Reject H_0 at alpha = 0.05, thus Education has an effect on Log_Income")
}

## [1] "Reject H_0 at alpha = 0.05, thus Education has an effect on Log_Income"

$~$

14 $\mid$ Test H0: the effects of (dippl1==5) while being a woman is the same as the effect of (dippl1==3) while being a man. Follow these steps:

$~$

14.a $\mid\mid$ Set up the relevant hypothesis and t-statistic

$~$

14.b $\mid\mid$ Compute its standard error using the Delta method

$~$

14.c $\mid\mid$ Conclude

$~$

BASTIENPatras_PS3Empirics

Sciences Po

Bastien PATRAS

2/24/2021

Online version of the code \(\mid\) https://rpubs.com/Patras/730416

1 \(\mid\) Load STATA dataset ee2002ext.dta

2 \(\mid\) Create the variable log(salfr) and trim the sample below the \(0.5^{th}\) percentile and above the \(99.5^{th}\) one.

2.1 \(\mid\mid\) Remove observations corresponding to individuals still in school (ddipl1=7)

3 \(\mid\) Generate age squared (age is agd)

4 \(\mid\) Regress log(salfr) on gender (variable s), age, age squared and ddipl1.

5 \(\mid\) Interpret the regression output.

6 \(\mid\) Create global variables N, K + 1 and SSR.

7 \(\mid\) Test the null hypothesis that there is no effect of education

8 \(\mid\) Test the null hypothesis that there is no effect of education

9 \(\mid\) Estimate the model under the null that the effects of (ddipl1==2) and (ddipl1==3) are equal.

10 \(\mid\) Regress log(salfr) on gender (variable s), age, age squared

11 \(\mid\) Create a global variable containing SSR for that model

12 \(\mid\) Create a global variable containing the Fisher statistic for the test that education has no effect. Compare with Stata output.

13 \(\mid\) Ask Stata to provide the \(95^{th}\) percentile of the Fisher distribution with the right degrees of freedom for the test. (Use command “display” and the Fisher CDF function.) What do you deduce?

14 \(\mid\) Test H0: the effects of (dippl1==5) while being a woman is the same as the effect of (dippl1==3) while being a man. Follow these steps:

14.a \(\mid\mid\) Set up the relevant hypothesis and t-statistic

14.b \(\mid\mid\) Compute its standard error using the Delta method

14.c \(\mid\mid\) Conclude