\(~\)

Online version of the code \(\mid\) https://rpubs.com/Patras/730416

\(~\)

1 \(\mid\) Load STATA dataset ee2002ext.dta

2 \(\mid\) Create the variable log(salfr) and trim the sample below the \(0.5^{th}\) percentile and above the \(99.5^{th}\) one.

2.1 \(\mid\mid\) Remove observations corresponding to individuals still in school (ddipl1=7)

3 \(\mid\) Generate age squared (age is agd)

\(~\)

#Loading the required libraries

library(haven)
library(tidyverse)
library(tidyr)
library(stargazer) 
library(xtable) 
library(ggplot2)
library(kableExtra)

# Loading the dataset

data <- read_dta("/Users/bastienpatras/Desktop/Sciences Po - Master in Economics/Econometrics/Pbset 3/ee2002ext.dta")

data_1 <- data %>% na.omit() %>%
                   select(salfr, ddipl1, s, agd) %>%
                   rename(Income = salfr, Education = ddipl1, Gender = s, Age = agd) %>%
                   mutate(Log_Income = log(Income),
                          Age_squared = Age^2) %>%
                   filter(Log_Income > quantile(Log_Income, probs = 0.005), 
                          Log_Income < quantile(Log_Income, probs = 0.995), 
                          Education!= 7)

head(data_1) %>% kbl(booktabs = T) %>%
             kable_styling(latex_options = "striped", full_width = T) 
Income Education Gender Age Log_Income Age_squared
5523 3 2 47 8.616677 2209
7500 5 2 28 8.922658 784
14000 3 1 56 9.546813 3136
10800 2 1 43 9.287301 1849
7800 3 2 42 8.961879 1764
8800 3 1 52 9.082507 2704

\(~\)

4 \(\mid\) Regress log(salfr) on gender (variable s), age, age squared and ddipl1.

\(~\)

# Linear regression : log(salfr) on gender (variable s), age, age squared and ddipl1

model_1 <- lm(Log_Income ~ as.factor(Gender) + Age + 
                Age_squared + as.factor(Education), data = data_1)

# Output

summary(model_1)
## 
## Call:
## lm(formula = Log_Income ~ as.factor(Gender) + Age + Age_squared + 
##     as.factor(Education), data = data_1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6508 -0.1705  0.0518  0.2604  1.7863 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            7.565e+00  4.751e-02  159.23   <2e-16 ***
## as.factor(Gender)2    -3.340e-01  6.717e-03  -49.73   <2e-16 ***
## Age                    5.362e-02  2.443e-03   21.95   <2e-16 ***
## Age_squared           -5.147e-04  3.031e-05  -16.98   <2e-16 ***
## as.factor(Education)2  2.700e-01  1.399e-02   19.30   <2e-16 ***
## as.factor(Education)3  2.314e-01  9.347e-03   24.76   <2e-16 ***
## as.factor(Education)4  4.149e-01  1.145e-02   36.25   <2e-16 ***
## as.factor(Education)5  6.023e-01  1.166e-02   51.64   <2e-16 ***
## as.factor(Education)6  8.225e-01  1.172e-02   70.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4877 on 21564 degrees of freedom
## Multiple R-squared:  0.2872, Adjusted R-squared:  0.2869 
## F-statistic:  1086 on 8 and 21564 DF,  p-value: < 2.2e-16

\(~\)

5 \(\mid\) Interpret the regression output.

\(~\)

The way to interpret categorical variable coefficients is the following :

\(~\)

\(~\)

The way to interpret numerical variable coefficients is the following :

\(~\)

\(~\)

The output from the lm function gives us an idea of the statistical significance (statistically different than zero) of each coefficients, in our case each variables of the model are statistical significant, which means that they are useful in explaining log_salary

\(~\)

The output from the lm function provides us with the \(R^2\) and more importantly the \(R^2_{\text{adjusted}} = 0.2869\) which corrects the \(R^2\) from the effect of adding variables in the model. In our case, the adjusted R-squared is quite low, it means that only 28.69% of the variation of log_salary is explained by the model.

\(~\)

Finally, the output from the lm function provides us with the \(F_{\text{stat}}\) statistic and more importantly the result from the F-test (testing whether all coefficients are equal to 0) on the global significance of the model \(F_{\text{stat}} = 1086\). As the the p-value is extremely small, we can safely assume the global significance of the model.

\(~\)

6 \(\mid\) Create global variables N, K + 1 and SSR.

\(~\)

As we know that :

\[ \text{SST} = \text{SSE} + \text{SSR} \Longleftrightarrow \sum_{i=1}^{n}\left(y_i-\bar{y}\right)^2 = \sum_{i=1}^{n}\left(\hat{y}_i-\bar{y}\right)^2 + \sum_{i=1}^{n}\left(\hat{y}_i-y_i\right)^2 \]

We can easily compute the SSR, N which corresponds to the number of observations and K, the degree of freedom that corresponds to the number of independent variables that enter linearly in the model.

\(~\)

# Computing SSR, N and K+1 from data_1 dataset

SRR_table <- data_1 %>%
             summarise(SST = sum((Log_Income-mean(Log_Income))^2), 
                       SSE = sum((fitted(model_1)-mean(Log_Income))^2),
                       SSR = SST - SSE,
                       N = length(Log_Income),
                       'K+1' = 8+1)

# Output

SRR_table %>% kbl(booktabs = T) %>%
             kable_styling(latex_options = "striped", full_width = T) %>%
             column_spec(3, color = "red")
SST SSE SSR N K+1
7196.053 2066.75 5129.303 21573 9

\(~\)

7 \(\mid\) Test the null hypothesis that there is no effect of education

\(~\)

We want to test the hypothesis that there is no effect of education, therefore we can set up the following bilateral test :

\[ \begin{cases} H_0 : \beta_{\text{Education}} &= 0 \\ H_1 : \beta_{\text{Education}} &\neq 0 \end{cases} \]

\(~\)

As our data set provides us with the specific distribution of the level of education by type of degree obtained, the significance is given by the t-statistic in the summary computed above. In our case, the \(Pr(>|t|)\) states that having a degree (whatever the type of degree) is statistically different than 0.

\(~\)

8 \(\mid\) Test the null hypothesis that there is no effect of education

\(~\)

We want to test the hypothesis that there is the effect of having an high school degree is the same as having a middle school degree, therefore we can set up the following bilateral test :

\(~\)

\[ \begin{cases} H_0 : \beta_{(\text{Education == 2})} = \beta_{(\text{Education == 3})} \\ H_1 : \beta_{(\text{Education == 2})} \neq \beta_{(\text{Education == 3})} \end{cases} \Longleftrightarrow \begin{cases} H_0 : \beta_{(\text{Education == 2})} -\beta_{(\text{Education == 3})} = 0 \\ H_1 : \beta_{(\text{Education == 2})} - \beta_{(\text{Education == 3})} \neq 0 \end{cases}\]

\(~\)

The statistic of this test corresponds to :

\(~\)

\[ T = \frac{\left(\hat{\beta}_{(\text{Education == 2})}-\hat{\beta}_{(\text{Education == 3})}\right) - (\beta_1-\beta_2) = 0}{\hat{\sigma}\left[\hat{\beta}_{(\text{Education == 2})}-\hat{\beta}_{(\text{Education == 3})}\right]} = \frac{\left(\hat{\beta}_{(\text{Education == 2})}-\hat{\beta}_{(\text{Education == 3})}\right)}{\hat{\sigma}\left[\hat{\beta}_{(\text{Education == 2})}-\hat{\beta}_{(\text{Education == 3})}\right]} \]

\(~\)

# Extracting the coefficients of interest 

beta_Educ_2 <- model_1$coef[5]
beta_Educ_3 <- model_1$coef[6]

\(~\)

We can then compute the value of the standard error we need :

\[ \sigma(\hat{\beta}_{(\text{Education == 2})}-\hat{\beta}_{(\text{Education == 3})}) = \sqrt{\text{Var}(\hat{\beta}_{(\text{Education == 2})}) + \text{Var}(\hat{\beta}_{(\text{Education == 3})})-2\times \text{Var}(\hat{\beta}_{(\text{Education == 2})},\hat{\beta}_{(\text{Education == 3})})} \]

\(~\)

# Extracting the corresponding variances 

var_beta_Educ_2 <- vcov(model_1)[5,5]
var_beta_Educ_3 <- vcov(model_1)[6,6]

# Extracting the corresponding covariances 

covar_beta_Educ_2_Educ_3 <- vcov(model_1)[5,6]

# Computing sigma :

sigma_Educ_2_Educ_3 <- sqrt(var_beta_Educ_2-var_beta_Educ_3+2*covar_beta_Educ_2_Educ_3)

\(~\)

The statistic and the decision rule corresponding to this test :

\(~\)

\[\begin{cases} \left(T > t_{1-\frac{\alpha}{2}} \textbf{ } \cup \textbf{ }T < - t_{1-\frac{\alpha}{2}}\right)& \text{: We reject $H_0$ at $\alpha = 0.05$} \\ T < t_{1-\frac{\alpha}{2}} & \text{: We do not reject $H_0$ at $\alpha = 0.05 $} \end{cases}\]

\(~\)

# Setting the level of alpha

alpha_level <- 0.05

# Extracting the corresponding variances 

var_beta_Educ_2 <- vcov(model_1)[5,5]
var_beta_Educ_3 <- vcov(model_1)[6,6]

# Computing the t-statistic 

t_stat = (var_beta_Educ_2-var_beta_Educ_3)/(sigma_Educ_2_Educ_3)

t_test <- qt(1-alpha_level/2, df.residual(model_1))

# Decision rule from the student distribution

if (t_stat < t_test) {
print("Do not reject H_0 at alpha = 0.05")
} else {
print("Reject H_0 at alpha = 0.05")
}
## [1] "Do not reject H_0 at alpha = 0.05"

\(~\)

9 \(\mid\) Estimate the model under the null that the effects of (ddipl1==2) and (ddipl1==3) are equal.

\(~\)

Under the assumption that \(\hat{\beta}_{(\text{Education == 2})} = \hat{\beta}_{(\text{Education == 3})}\) we can factorized the \(\hat{\beta}_{(\text{Education == 2})}\) in front of \(\text{Education == 2}\) and \(\text{Education == 3}\) in the linear model.

\(~\)

# Transform the dataset with the new variable

data_2 <- data_1 %>% 
  mutate(Education_1 = if_else(Education == 1, 1, 0),
         Education_2 = if_else(Education == 2, 1, 0), 
         Education_3 = if_else(Education == 3, 1, 0),
         Education_4 = if_else(Education == 4, 1, 0),
         Education_5 = if_else(Education == 5, 1, 0),
         Education_6 = if_else(Education == 6, 1, 0),
         Education_3_2 = Education_3+Education_2)

# Running the model with the transformed dataset

model_2 <- lm(Log_Income ~ as.factor(Gender) + Age + 
                Age_squared + 
                as.factor(Education_3_2) +
                as.factor(Education_4) +
                as.factor(Education_5) +
                as.factor(Education_6), 
              data = data_2)

# Output

summary(model_2)
## 
## Call:
## lm(formula = Log_Income ~ as.factor(Gender) + Age + Age_squared + 
##     as.factor(Education_3_2) + as.factor(Education_4) + as.factor(Education_5) + 
##     as.factor(Education_6), data = data_2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6500 -0.1705  0.0520  0.2608  1.7847 
## 
## Coefficients:
##                             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                7.566e+00  4.752e-02  159.23   <2e-16 ***
## as.factor(Gender)2        -3.329e-01  6.707e-03  -49.64   <2e-16 ***
## Age                        5.350e-02  2.443e-03   21.90   <2e-16 ***
## Age_squared               -5.128e-04  3.031e-05  -16.92   <2e-16 ***
## as.factor(Education_3_2)1  2.395e-01  8.893e-03   26.93   <2e-16 ***
## as.factor(Education_4)1    4.150e-01  1.145e-02   36.25   <2e-16 ***
## as.factor(Education_5)1    6.024e-01  1.166e-02   51.64   <2e-16 ***
## as.factor(Education_6)1    8.226e-01  1.172e-02   70.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4878 on 21565 degrees of freedom
## Multiple R-squared:  0.2869, Adjusted R-squared:  0.2867 
## F-statistic:  1240 on 7 and 21565 DF,  p-value: < 2.2e-16

\(~\)

10 \(\mid\) Regress log(salfr) on gender (variable s), age, age squared

\(~\)

# Linear regression : log(salfr) on gender (variable s), age, age and squared

model_3 <- lm(Log_Income ~ as.factor(Gender) + Age + 
                Age_squared, data = data_1)

# Output

summary(model_3)
## 
## Call:
## lm(formula = Log_Income ~ as.factor(Gender) + Age + Age_squared, 
##     data = data_1)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1712 -0.2241  0.0163  0.2929  1.8426 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         7.973e+00  5.312e-02  150.08   <2e-16 ***
## as.factor(Gender)2 -2.825e-01  7.525e-03  -37.53   <2e-16 ***
## Age                 5.392e-02  2.758e-03   19.55   <2e-16 ***
## Age_squared        -5.831e-04  3.421e-05  -17.05   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5521 on 21569 degrees of freedom
## Multiple R-squared:  0.08648,    Adjusted R-squared:  0.08635 
## F-statistic: 680.6 on 3 and 21569 DF,  p-value: < 2.2e-16

\(~\)

11 \(\mid\) Create a global variable containing SSR for that model

\(~\)

# Linear regression : log(salfr) on gender (variable s), age, age and squared

model_3 <- lm(Log_Income ~ as.factor(Gender) + Age + 
                Age_squared, data = data_1)


# Adding the predicted values to the transformed dataset : 

SRR_table_2 <- data_1 %>%
             summarise(SST = sum((Log_Income-mean(Log_Income))^2), 
                       SSE = sum((fitted(model_1)-mean(Log_Income))^2),
                       SSE_cm = sum((fitted(model_3)-mean(Log_Income))^2),
                       SSR    = SST - SSE,
                       SSR_cm = SST - SSE_cm,
                       N = length(Log_Income),
                       'K+1' = 4+1)

SRR_table_2 %>% kbl(booktabs = T) %>%
             kable_styling(latex_options = "striped", full_width = T) %>%
             column_spec(5, color = "red")
SST SSE SSE_cm SSR SSR_cm N K+1
7196.053 2066.75 622.3263 5129.303 6573.727 21573 5

\(~\)

12 \(\mid\) Create a global variable containing the Fisher statistic for the test that education has no effect. Compare with Stata output.

\(~\)

The Fisher statistic is given such that :

\[ F_{\text{stat}} = \frac{\text{SSR}_{\text{cm}} - \text{SSR}}{\text{SSR}} \times \frac{N-(K+1)}{K} \]

# Creating the Fisher statistic for the test

F_stat <- SRR_table_2 %>%
          summarise(F_stat = ((SSR_cm-SSR)/4)/(SSR/(N-5)) )

F_stat %>% kbl(booktabs = T) %>%
             kable_styling(latex_options = "striped", full_width = T) %>%
             column_spec(1, color = "red")
F_stat
1518.4

\(~\)

Our computed F-statistic differs slightly from the one computed thanks to the lm() function. The difference between this two results is explained by the number of observations and the degrees of freedom.

\(~\)

13 \(\mid\) Ask Stata to provide the \(95^{th}\) percentile of the Fisher distribution with the right degrees of freedom for the test. (Use command “display” and the Fisher CDF function.) What do you deduce?

\(~\)

# Computing theoretical F-stat 

q_F <- qf(0.95, df1 = 5, df2 = length(data_1$Log_Income))

# Decision rule for the global significance test 

if (q_F > F_stat$F_stat) {
print("Do not reject H_0 at alpha = 0.05, thus Education has no effect on Log_Income")
} else {
print("Reject H_0 at alpha = 0.05, thus Education has an effect on Log_Income")
}
## [1] "Reject H_0 at alpha = 0.05, thus Education has an effect on Log_Income"

\(~\)

14 \(\mid\) Test H0: the effects of (dippl1==5) while being a woman is the same as the effect of (dippl1==3) while being a man. Follow these steps:

\(~\)

\(~\)

14.a \(\mid\mid\) Set up the relevant hypothesis and t-statistic

\(~\)

\(~\)

14.b \(\mid\mid\) Compute its standard error using the Delta method

\(~\)

\(~\)

14.c \(\mid\mid\) Conclude

\(~\)