\(~\)
\(~\)
\(~\)
#Loading the required libraries
library(haven)
library(tidyverse)
library(tidyr)
library(stargazer)
library(xtable)
library(ggplot2)
library(kableExtra)
# Loading the dataset
data <- read_dta("/Users/bastienpatras/Desktop/Sciences Po - Master in Economics/Econometrics/Pbset 3/ee2002ext.dta")
data_1 <- data %>% na.omit() %>%
select(salfr, ddipl1, s, agd) %>%
rename(Income = salfr, Education = ddipl1, Gender = s, Age = agd) %>%
mutate(Log_Income = log(Income),
Age_squared = Age^2) %>%
filter(Log_Income > quantile(Log_Income, probs = 0.005),
Log_Income < quantile(Log_Income, probs = 0.995),
Education!= 7)
head(data_1) %>% kbl(booktabs = T) %>%
kable_styling(latex_options = "striped", full_width = T)
| Income | Education | Gender | Age | Log_Income | Age_squared |
|---|---|---|---|---|---|
| 5523 | 3 | 2 | 47 | 8.616677 | 2209 |
| 7500 | 5 | 2 | 28 | 8.922658 | 784 |
| 14000 | 3 | 1 | 56 | 9.546813 | 3136 |
| 10800 | 2 | 1 | 43 | 9.287301 | 1849 |
| 7800 | 3 | 2 | 42 | 8.961879 | 1764 |
| 8800 | 3 | 1 | 52 | 9.082507 | 2704 |
\(~\)
\(~\)
# Linear regression : log(salfr) on gender (variable s), age, age squared and ddipl1
model_1 <- lm(Log_Income ~ as.factor(Gender) + Age +
Age_squared + as.factor(Education), data = data_1)
# Output
summary(model_1)
##
## Call:
## lm(formula = Log_Income ~ as.factor(Gender) + Age + Age_squared +
## as.factor(Education), data = data_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6508 -0.1705 0.0518 0.2604 1.7863
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.565e+00 4.751e-02 159.23 <2e-16 ***
## as.factor(Gender)2 -3.340e-01 6.717e-03 -49.73 <2e-16 ***
## Age 5.362e-02 2.443e-03 21.95 <2e-16 ***
## Age_squared -5.147e-04 3.031e-05 -16.98 <2e-16 ***
## as.factor(Education)2 2.700e-01 1.399e-02 19.30 <2e-16 ***
## as.factor(Education)3 2.314e-01 9.347e-03 24.76 <2e-16 ***
## as.factor(Education)4 4.149e-01 1.145e-02 36.25 <2e-16 ***
## as.factor(Education)5 6.023e-01 1.166e-02 51.64 <2e-16 ***
## as.factor(Education)6 8.225e-01 1.172e-02 70.19 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4877 on 21564 degrees of freedom
## Multiple R-squared: 0.2872, Adjusted R-squared: 0.2869
## F-statistic: 1086 on 8 and 21564 DF, p-value: < 2.2e-16
\(~\)
\(~\)
The way to interpret categorical variable coefficients is the following :
\(~\)
log_salary by -3.340e-01% and thus decreases the income by \(e^{-3.340e-01}\%\). The interpretation for the coefficients in front of education is the same : for instance having a middle school degree (coded as 2) increases the log_salary by 2.700e-01% and the salary by \(e^{2.700e-01}\) (demo in the theoretical part)\(~\)
The way to interpret numerical variable coefficients is the following :
\(~\)
Age, thus one more year, increases the log_salary by 2.700e-01% an the salary (unlogged) by \(e^{2.700e-01}\%\)\(~\)
The output from the lm function gives us an idea of the statistical significance (statistically different than zero) of each coefficients, in our case each variables of the model are statistical significant, which means that they are useful in explaining log_salary
\(~\)
The output from the lm function provides us with the \(R^2\) and more importantly the \(R^2_{\text{adjusted}} = 0.2869\) which corrects the \(R^2\) from the effect of adding variables in the model. In our case, the adjusted R-squared is quite low, it means that only 28.69% of the variation of log_salary is explained by the model.
\(~\)
Finally, the output from the lm function provides us with the \(F_{\text{stat}}\) statistic and more importantly the result from the F-test (testing whether all coefficients are equal to 0) on the global significance of the model \(F_{\text{stat}} = 1086\). As the the p-value is extremely small, we can safely assume the global significance of the model.
\(~\)
\(~\)
As we know that :
\[ \text{SST} = \text{SSE} + \text{SSR} \Longleftrightarrow \sum_{i=1}^{n}\left(y_i-\bar{y}\right)^2 = \sum_{i=1}^{n}\left(\hat{y}_i-\bar{y}\right)^2 + \sum_{i=1}^{n}\left(\hat{y}_i-y_i\right)^2 \]
We can easily compute the SSR, N which corresponds to the number of observations and K, the degree of freedom that corresponds to the number of independent variables that enter linearly in the model.
\(~\)
# Computing SSR, N and K+1 from data_1 dataset
SRR_table <- data_1 %>%
summarise(SST = sum((Log_Income-mean(Log_Income))^2),
SSE = sum((fitted(model_1)-mean(Log_Income))^2),
SSR = SST - SSE,
N = length(Log_Income),
'K+1' = 8+1)
# Output
SRR_table %>% kbl(booktabs = T) %>%
kable_styling(latex_options = "striped", full_width = T) %>%
column_spec(3, color = "red")
| SST | SSE | SSR | N | K+1 |
|---|---|---|---|---|
| 7196.053 | 2066.75 | 5129.303 | 21573 | 9 |
\(~\)
\(~\)
We want to test the hypothesis that there is no effect of education, therefore we can set up the following bilateral test :
\[ \begin{cases} H_0 : \beta_{\text{Education}} &= 0 \\ H_1 : \beta_{\text{Education}} &\neq 0 \end{cases} \]
\(~\)
As our data set provides us with the specific distribution of the level of education by type of degree obtained, the significance is given by the t-statistic in the summary computed above. In our case, the \(Pr(>|t|)\) states that having a degree (whatever the type of degree) is statistically different than 0.
\(~\)
\(~\)
We want to test the hypothesis that there is the effect of having an high school degree is the same as having a middle school degree, therefore we can set up the following bilateral test :
\(~\)
\[ \begin{cases} H_0 : \beta_{(\text{Education == 2})} = \beta_{(\text{Education == 3})} \\ H_1 : \beta_{(\text{Education == 2})} \neq \beta_{(\text{Education == 3})} \end{cases} \Longleftrightarrow \begin{cases} H_0 : \beta_{(\text{Education == 2})} -\beta_{(\text{Education == 3})} = 0 \\ H_1 : \beta_{(\text{Education == 2})} - \beta_{(\text{Education == 3})} \neq 0 \end{cases}\]
\(~\)
The statistic of this test corresponds to :
\(~\)
\[ T = \frac{\left(\hat{\beta}_{(\text{Education == 2})}-\hat{\beta}_{(\text{Education == 3})}\right) - (\beta_1-\beta_2) = 0}{\hat{\sigma}\left[\hat{\beta}_{(\text{Education == 2})}-\hat{\beta}_{(\text{Education == 3})}\right]} = \frac{\left(\hat{\beta}_{(\text{Education == 2})}-\hat{\beta}_{(\text{Education == 3})}\right)}{\hat{\sigma}\left[\hat{\beta}_{(\text{Education == 2})}-\hat{\beta}_{(\text{Education == 3})}\right]} \]
\(~\)
# Extracting the coefficients of interest
beta_Educ_2 <- model_1$coef[5]
beta_Educ_3 <- model_1$coef[6]
\(~\)
We can then compute the value of the standard error we need :
\[ \sigma(\hat{\beta}_{(\text{Education == 2})}-\hat{\beta}_{(\text{Education == 3})}) = \sqrt{\text{Var}(\hat{\beta}_{(\text{Education == 2})}) + \text{Var}(\hat{\beta}_{(\text{Education == 3})})-2\times \text{Var}(\hat{\beta}_{(\text{Education == 2})},\hat{\beta}_{(\text{Education == 3})})} \]
\(~\)
# Extracting the corresponding variances
var_beta_Educ_2 <- vcov(model_1)[5,5]
var_beta_Educ_3 <- vcov(model_1)[6,6]
# Extracting the corresponding covariances
covar_beta_Educ_2_Educ_3 <- vcov(model_1)[5,6]
# Computing sigma :
sigma_Educ_2_Educ_3 <- sqrt(var_beta_Educ_2-var_beta_Educ_3+2*covar_beta_Educ_2_Educ_3)
\(~\)
The statistic and the decision rule corresponding to this test :
\(~\)
\[\begin{cases} \left(T > t_{1-\frac{\alpha}{2}} \textbf{ } \cup \textbf{ }T < - t_{1-\frac{\alpha}{2}}\right)& \text{: We reject $H_0$ at $\alpha = 0.05$} \\ T < t_{1-\frac{\alpha}{2}} & \text{: We do not reject $H_0$ at $\alpha = 0.05 $} \end{cases}\]
\(~\)
# Setting the level of alpha
alpha_level <- 0.05
# Extracting the corresponding variances
var_beta_Educ_2 <- vcov(model_1)[5,5]
var_beta_Educ_3 <- vcov(model_1)[6,6]
# Computing the t-statistic
t_stat = (var_beta_Educ_2-var_beta_Educ_3)/(sigma_Educ_2_Educ_3)
t_test <- qt(1-alpha_level/2, df.residual(model_1))
# Decision rule from the student distribution
if (t_stat < t_test) {
print("Do not reject H_0 at alpha = 0.05")
} else {
print("Reject H_0 at alpha = 0.05")
}
## [1] "Do not reject H_0 at alpha = 0.05"
\(~\)
\(~\)
Under the assumption that \(\hat{\beta}_{(\text{Education == 2})} = \hat{\beta}_{(\text{Education == 3})}\) we can factorized the \(\hat{\beta}_{(\text{Education == 2})}\) in front of \(\text{Education == 2}\) and \(\text{Education == 3}\) in the linear model.
\(~\)
# Transform the dataset with the new variable
data_2 <- data_1 %>%
mutate(Education_1 = if_else(Education == 1, 1, 0),
Education_2 = if_else(Education == 2, 1, 0),
Education_3 = if_else(Education == 3, 1, 0),
Education_4 = if_else(Education == 4, 1, 0),
Education_5 = if_else(Education == 5, 1, 0),
Education_6 = if_else(Education == 6, 1, 0),
Education_3_2 = Education_3+Education_2)
# Running the model with the transformed dataset
model_2 <- lm(Log_Income ~ as.factor(Gender) + Age +
Age_squared +
as.factor(Education_3_2) +
as.factor(Education_4) +
as.factor(Education_5) +
as.factor(Education_6),
data = data_2)
# Output
summary(model_2)
##
## Call:
## lm(formula = Log_Income ~ as.factor(Gender) + Age + Age_squared +
## as.factor(Education_3_2) + as.factor(Education_4) + as.factor(Education_5) +
## as.factor(Education_6), data = data_2)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.6500 -0.1705 0.0520 0.2608 1.7847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.566e+00 4.752e-02 159.23 <2e-16 ***
## as.factor(Gender)2 -3.329e-01 6.707e-03 -49.64 <2e-16 ***
## Age 5.350e-02 2.443e-03 21.90 <2e-16 ***
## Age_squared -5.128e-04 3.031e-05 -16.92 <2e-16 ***
## as.factor(Education_3_2)1 2.395e-01 8.893e-03 26.93 <2e-16 ***
## as.factor(Education_4)1 4.150e-01 1.145e-02 36.25 <2e-16 ***
## as.factor(Education_5)1 6.024e-01 1.166e-02 51.64 <2e-16 ***
## as.factor(Education_6)1 8.226e-01 1.172e-02 70.19 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4878 on 21565 degrees of freedom
## Multiple R-squared: 0.2869, Adjusted R-squared: 0.2867
## F-statistic: 1240 on 7 and 21565 DF, p-value: < 2.2e-16
\(~\)
\(~\)
# Linear regression : log(salfr) on gender (variable s), age, age and squared
model_3 <- lm(Log_Income ~ as.factor(Gender) + Age +
Age_squared, data = data_1)
# Output
summary(model_3)
##
## Call:
## lm(formula = Log_Income ~ as.factor(Gender) + Age + Age_squared,
## data = data_1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.1712 -0.2241 0.0163 0.2929 1.8426
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.973e+00 5.312e-02 150.08 <2e-16 ***
## as.factor(Gender)2 -2.825e-01 7.525e-03 -37.53 <2e-16 ***
## Age 5.392e-02 2.758e-03 19.55 <2e-16 ***
## Age_squared -5.831e-04 3.421e-05 -17.05 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5521 on 21569 degrees of freedom
## Multiple R-squared: 0.08648, Adjusted R-squared: 0.08635
## F-statistic: 680.6 on 3 and 21569 DF, p-value: < 2.2e-16
\(~\)
\(~\)
# Linear regression : log(salfr) on gender (variable s), age, age and squared
model_3 <- lm(Log_Income ~ as.factor(Gender) + Age +
Age_squared, data = data_1)
# Adding the predicted values to the transformed dataset :
SRR_table_2 <- data_1 %>%
summarise(SST = sum((Log_Income-mean(Log_Income))^2),
SSE = sum((fitted(model_1)-mean(Log_Income))^2),
SSE_cm = sum((fitted(model_3)-mean(Log_Income))^2),
SSR = SST - SSE,
SSR_cm = SST - SSE_cm,
N = length(Log_Income),
'K+1' = 4+1)
SRR_table_2 %>% kbl(booktabs = T) %>%
kable_styling(latex_options = "striped", full_width = T) %>%
column_spec(5, color = "red")
| SST | SSE | SSE_cm | SSR | SSR_cm | N | K+1 |
|---|---|---|---|---|---|---|
| 7196.053 | 2066.75 | 622.3263 | 5129.303 | 6573.727 | 21573 | 5 |
\(~\)
\(~\)
The Fisher statistic is given such that :
\[ F_{\text{stat}} = \frac{\text{SSR}_{\text{cm}} - \text{SSR}}{\text{SSR}} \times \frac{N-(K+1)}{K} \]
# Creating the Fisher statistic for the test
F_stat <- SRR_table_2 %>%
summarise(F_stat = ((SSR_cm-SSR)/4)/(SSR/(N-5)) )
F_stat %>% kbl(booktabs = T) %>%
kable_styling(latex_options = "striped", full_width = T) %>%
column_spec(1, color = "red")
| F_stat |
|---|
| 1518.4 |
\(~\)
Our computed F-statistic differs slightly from the one computed thanks to the lm() function. The difference between this two results is explained by the number of observations and the degrees of freedom.
\(~\)
\(~\)
# Computing theoretical F-stat
q_F <- qf(0.95, df1 = 5, df2 = length(data_1$Log_Income))
# Decision rule for the global significance test
if (q_F > F_stat$F_stat) {
print("Do not reject H_0 at alpha = 0.05, thus Education has no effect on Log_Income")
} else {
print("Reject H_0 at alpha = 0.05, thus Education has an effect on Log_Income")
}
## [1] "Reject H_0 at alpha = 0.05, thus Education has an effect on Log_Income"
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)
\(~\)