INTRODUCTION
We declared some libraries to make our estimations. Then we loaded our data base.
library(haven)
library(car)
library(stargazer)
library(lmtest)
SDEMT118 <- read_dta(file.choose())
We have eliminated some lines.
SDEMT118 <- SDEMT118[SDEMT118$eda > 18,] # solo mayores de edad
SDEMT118 <- SDEMT118[SDEMT118$ingocup > 0,] # solo ingresos mayores a cero
SDEMT118 <- SDEMT118[SDEMT118$anios_esc != 99,] # años de escolaridad distintos a 99 (informacion no disponible)
Encoding some variables to make our regression easier.
SDEMT118$mujer <- recode(SDEMT118$sex, "1=0; 2=1") # mujer=1; 0=hombre
table(SDEMT118$mujer)
SDEMT118$casado <- recode(SDEMT118$e_con, "5=1; 1=0; 2=0; 3=0; 4=0; 6=0; 9=0") # mujer=1; 0= hombre
table(SDEMT118$casado)
Running the regression focusing on important variables. Then we summarized with “stargazer” the regression in order to interpret the results more easily.
length of NULL cannot be changedlength of NULL cannot be changedlength of NULL cannot be changedlength of NULL cannot be changedlength of NULL cannot be changed
=================================================
Dependent variable:
-----------------------------
log(ingocup)
-------------------------------------------------
mujer -0.283***
(0.004)
eda 0.040***
(0.001)
I(eda2) -0.0004***
(0.00001)
anios_esc 0.075***
(0.0005)
casado 0.038***
(0.004)
hrsocup 0.011***
(0.0001)
Constant 6.540***
(0.017)
-------------------------------------------------
Observations 112,660
R2 0.301
Adjusted R2 0.301
Residual Std. Error 0.641 (df = 112653)
F Statistic 8,081.319*** (df = 6; 112653)
=================================================
Note: *p<0.1; **p<0.05; ***p<0.01
0.5 % 99.5 %
(Intercept) 6.4952109637 6.5851018097
mujer -0.2938389997 -0.2728216185
eda 0.0378879514 0.0419496415
I(eda^2) -0.0004406711 -0.0003949861
anios_esc 0.0733600709 0.0757822642
casado 0.0276027273 0.0487996649
hrsocup 0.0109560386 0.0115236530
Making an Hipothesis with our model: T TEST
summary(reg)
Call:
lm(formula = log(ingocup) ~ mujer + eda + I(eda^2) + anios_esc +
casado + hrsocup, data = SDEMT118)
Residuals:
Min 1Q Median 3Q Max
-5.0028 -0.3203 0.0398 0.3784 3.8213
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.540e+00 1.745e-02 374.824 <2e-16 ***
mujer -2.833e-01 4.080e-03 -69.449 <2e-16 ***
eda 3.992e-02 7.884e-04 50.632 <2e-16 ***
I(eda^2) -4.178e-04 8.868e-06 -47.117 <2e-16 ***
anios_esc 7.457e-02 4.702e-04 158.605 <2e-16 ***
casado 3.820e-02 4.115e-03 9.284 <2e-16 ***
hrsocup 1.124e-02 1.102e-04 102.014 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.6415 on 112653 degrees of freedom
(7182 observations deleted due to missingness)
Multiple R-squared: 0.3009, Adjusted R-squared: 0.3009
F-statistic: 8081 on 6 and 112653 DF, p-value: < 2.2e-16
We can assume that each variable is very significant at the significance level of 5%.
Then we created a graph (histogram) of the “ingocup” variable.
Doing a Breusch-Pagan test and a White test and interpreting the result.
Breusch-Pegan test
bptest(reg)
studentized Breusch-Pagan test
data: reg
BP = 4151.9, df = 6, p-value < 2.2e-16
White test
bptest(reg, ~fitted(reg) + I(fitted(reg)^2))
studentized Breusch-Pagan test
data: reg
BP = 2774.6, df = 2, p-value < 2.2e-16
conclusion: There is heteroskedasticity given the fact that 2.2e-16 < 0.05 So we are going to adjust using-robust estandar errors.
reg3 <- coeftest(reg, hccm)
reg3
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.5402e+00 1.9203e-02 340.5831 < 2.2e-16 ***
mujer -2.8333e-01 4.1152e-03 -68.8489 < 2.2e-16 ***
eda 3.9919e-02 9.0704e-04 44.0102 < 2.2e-16 ***
I(eda^2) -4.1783e-04 1.0794e-05 -38.7079 < 2.2e-16 ***
anios_esc 7.4571e-02 5.2625e-04 141.7019 < 2.2e-16 ***
casado 3.8201e-02 4.2141e-03 9.0651 < 2.2e-16 ***
hrsocup 1.1240e-02 1.3675e-04 82.1904 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
stargazer(reg, reg3, type = "text")
length of NULL cannot be changedlength of NULL cannot be changedlength of NULL cannot be changedlength of NULL cannot be changedlength of NULL cannot be changed
=============================================================
Dependent variable:
-----------------------------------------
log(ingocup)
OLS coefficient
test
(1) (2)
-------------------------------------------------------------
mujer -0.283*** -0.283***
(0.004) (0.004)
eda 0.040*** 0.040***
(0.001) (0.001)
I(eda2) -0.0004*** -0.0004***
(0.00001) (0.00001)
anios_esc 0.075*** 0.075***
(0.0005) (0.001)
casado 0.038*** 0.038***
(0.004) (0.004)
hrsocup 0.011*** 0.011***
(0.0001) (0.0001)
Constant 6.540*** 6.540***
(0.017) (0.019)
-------------------------------------------------------------
Observations 112,660
R2 0.301
Adjusted R2 0.301
Residual Std. Error 0.641 (df = 112653)
F Statistic 8,081.319*** (df = 6; 112653)
=============================================================
Note: *p<0.1; **p<0.05; ***p<0.01
Creating three dummy variables to run the new regression and see what is the effect of being single female/male, and married female/male.
stargazer(reg2, type = "text")
length of NULL cannot be changedlength of NULL cannot be changedlength of NULL cannot be changedlength of NULL cannot be changedlength of NULL cannot be changed
=================================================
Dependent variable:
-----------------------------
log(ingocup)
-------------------------------------------------
marrfem -0.352***
(0.006)
singmal -0.085***
(0.005)
singfem -0.319***
(0.005)
eda 0.040***
(0.001)
I(eda2) -0.0004***
(0.00001)
anios_esc 0.074***
(0.0005)
hrsocup 0.011***
(0.0001)
Constant 6.617***
(0.019)
-------------------------------------------------
Observations 112,660
R2 0.302
Adjusted R2 0.302
Residual Std. Error 0.641 (df = 112652)
F Statistic 6,971.641*** (df = 7; 112652)
=================================================
Note: *p<0.1; **p<0.05; ***p<0.01
CONCLUSION
Acording to the economic theory, there are no redundant variables because we have specified our variables correctly, and much less there is perfect collinearity with our independent variables.
We wondered if it’s possible that some other variables exist which might have an influence on monthly income, so according to the ENOE data, we took into account the informal and formal job variable (job). We made the regression, and according to this, the variable shows that for the simple fact of having a formal job, someone could earn 40% more that someone who doesn’t. we use dummy varibles for this regression too.
SDEMT118$job <- recode(SDEMT118$emp_ppal, "1=0; 2=1") # empleo informal=1; 0=empleo formal
reg2 <- lm(log(ingocup) ~ job + mujer + eda + I(eda^2) + anios_esc + casado + hrsocup, data=SDEMT118)
stargazer(reg2, type = "text")
length of NULL cannot be changedlength of NULL cannot be changedlength of NULL cannot be changedlength of NULL cannot be changedlength of NULL cannot be changed
=================================================
Dependent variable:
-----------------------------
log(ingocup)
-------------------------------------------------
job 0.400***
(0.004)
mujer -0.280***
(0.004)
eda 0.037***
(0.001)
I(eda2) -0.0004***
(0.00001)
anios_esc 0.057***
(0.0005)
casado 0.019***
(0.004)
hrsocup 0.009***
(0.0001)
Constant 6.678***
(0.017)
-------------------------------------------------
Observations 112,660
R2 0.357
Adjusted R2 0.357
Residual Std. Error 0.615 (df = 112652)
F Statistic 8,939.598*** (df = 7; 112652)
=================================================
Note: *p<0.1; **p<0.05; ***p<0.01