Estimaremos modelos de regresión lineal múltiple utilizando la misma base de datos de Wooldridge utilizada como ejemplo en el capítulo anterior, wage1, que contiene datos de la Encuesta de población de 1976.
#2.- BASE DE DATOS
library(wooldridge)
## Warning: package 'wooldridge' was built under R version 4.2.3
base <- wage1
names(base)
## [1] "wage" "educ" "exper" "tenure" "nonwhite" "female"
## [7] "married" "numdep" "smsa" "northcen" "south" "west"
## [13] "construc" "ndurman" "trcommpu" "trade" "services" "profserv"
## [19] "profocc" "clerocc" "servocc" "lwage" "expersq" "tenursq"
Las variables que utilizaremos en este capítulo son las siguientes:
wage: salario promedio por hora educ: años de educación exper: años de experiencia potencial ternure: años con el empleador actual (antigüedad) nonwhite: es igual a 1 si la persona no es blanca female: es igual a 1 si la persona es mujer married: es igual a 1 si la persona es casada lwage: logaritmo del salario, log(wage) expersq: el cuadrado de los años de experiencia potencial, exper2 tenursq: el cuadrado de la antigüedad en el trabajo actual, ternure2
names(base)[4] <- "antigüedad"
Seleccionamos las variables que nos interesan por la posición de las columnas en el data frame. Indicamos que queremos las variables que están desde la posición 1 hasta la 7 y las que están en la posición 22, 23 y 24. Las asignamos al objeto datos.
datos <- base[,c(1:7,22,23,24)]
dim(datos)
## [1] 526 10
Podríamos solicitar el número de filas (personas en este caso) y de columnas (variables) por separado con las siguientes funciones:
nrow(datos)
## [1] 526
ncol(datos)
## [1] 10
Miramos las observaciones de los datos.
head(datos, n=5)
## wage educ exper antigüedad nonwhite female married lwage expersq tenursq
## 1 3.10 11 2 0 0 1 0 1.131402 4 0
## 2 3.24 12 22 2 0 1 1 1.175573 484 4
## 3 3.00 11 2 0 0 0 0 1.098612 4 0
## 4 6.00 8 44 28 0 0 1 1.791759 1936 784
## 5 5.30 12 7 2 0 0 1 1.667707 49 4
Solicitamos los descriptivos con summary:
summary(datos)
## wage educ exper antigüedad
## Min. : 0.530 Min. : 0.00 Min. : 1.00 Min. : 0.000
## 1st Qu.: 3.330 1st Qu.:12.00 1st Qu.: 5.00 1st Qu.: 0.000
## Median : 4.650 Median :12.00 Median :13.50 Median : 2.000
## Mean : 5.896 Mean :12.56 Mean :17.02 Mean : 5.105
## 3rd Qu.: 6.880 3rd Qu.:14.00 3rd Qu.:26.00 3rd Qu.: 7.000
## Max. :24.980 Max. :18.00 Max. :51.00 Max. :44.000
## nonwhite female married lwage
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Min. :-0.6349
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.: 1.2030
## Median :0.0000 Median :0.0000 Median :1.0000 Median : 1.5369
## Mean :0.1027 Mean :0.4791 Mean :0.6084 Mean : 1.6233
## 3rd Qu.:0.0000 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 1.9286
## Max. :1.0000 Max. :1.0000 Max. :1.0000 Max. : 3.2181
## expersq tenursq
## Min. : 1.0 Min. : 0.00
## 1st Qu.: 25.0 1st Qu.: 0.00
## Median : 182.5 Median : 4.00
## Mean : 473.4 Mean : 78.15
## 3rd Qu.: 676.0 3rd Qu.: 49.00
## Max. :2601.0 Max. :1936.00
#3.- CORRELACIONES
library(GGally)
## Warning: package 'GGally' was built under R version 4.2.3
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.2.3
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(ggplot2)
ggpairs(datos)
round(cor(datos, method = "pearson"), 3)
## wage educ exper antigüedad nonwhite female married lwage
## wage 1.000 0.406 0.113 0.347 -0.039 -0.340 0.229 0.937
## educ 0.406 1.000 -0.300 -0.056 -0.085 -0.085 0.069 0.431
## exper 0.113 -0.300 1.000 0.499 0.014 -0.042 0.317 0.111
## antigüedad 0.347 -0.056 0.499 1.000 0.012 -0.198 0.240 0.326
## nonwhite -0.039 -0.085 0.014 0.012 1.000 -0.011 -0.062 -0.039
## female -0.340 -0.085 -0.042 -0.198 -0.011 1.000 -0.166 -0.374
## married 0.229 0.069 0.317 0.240 -0.062 -0.166 1.000 0.271
## lwage 0.937 0.431 0.111 0.326 -0.039 -0.374 0.271 1.000
## expersq 0.030 -0.331 0.961 0.459 0.009 -0.028 0.217 0.023
## tenursq 0.267 -0.069 0.423 0.922 -0.007 -0.176 0.167 0.236
## expersq tenursq
## wage 0.030 0.267
## educ -0.331 -0.069
## exper 0.961 0.423
## antigüedad 0.459 0.922
## nonwhite 0.009 -0.007
## female -0.028 -0.176
## married 0.217 0.167
## lwage 0.023 0.236
## expersq 1.000 0.414
## tenursq 0.414 1.000
Con la función attach(datos) logramos que el R reconozca el nombre de las variables del objeto datos sin necesidad de citar el nombre del data frame y luego el signo de $. A partir de este punto, realizaremos los diferentes comandos teniendo en cuenta que se aplicó esta función.
attach(datos)
library(ggplot2)
library(plotly)
## Warning: package 'plotly' was built under R version 4.2.3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(x = educ, y = antigüedad, z = wage, type = "scatter3d", color = wage) %>%
layout(scene = list(xaxis = list(title = 'educación (en años)'),
yaxis = list(title = 'antigüedad (en años)'),
zaxis = list(title = 'Salario (en USD/h)')))
## No scatter3d mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
library(scatterplot3d)
## Warning: package 'scatterplot3d' was built under R version 4.2.3
graf <- scatterplot3d(x = educ, y = antigüedad, z = wage, pch = 16,
cex.lab = 1, highlight.3d = TRUE, type = "h",
xlab = 'Años de educación',
ylab = 'Antigüedad (años)',
zlab = 'Salario (USD/h)')
#4.- Estimar un modelo para el salario
mod1 <- lm(wage ~ educ + antigüedad, data=datos)
summary(mod1)
##
## Call:
## lm(formula = wage ~ educ + antigüedad, data = datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.1438 -1.7288 -0.6372 1.2575 14.7482
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.22162 0.64015 -3.47 0.000563 ***
## educ 0.56914 0.04881 11.66 < 2e-16 ***
## antigüedad 0.18958 0.01871 10.13 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.092 on 523 degrees of freedom
## Multiple R-squared: 0.3019, Adjusted R-squared: 0.2992
## F-statistic: 113.1 on 2 and 523 DF, p-value: < 2.2e-16
Constante (-2.22) se estima que el salario independiente de la educación y la antigüedad del individuo ascendería a -2.22 unidades. Esto no tiene pues el mínimo salario cero (pero hay que ser conscientes que tampoco existen personas sin estudios ni antigüedad).
Educ (0.5691) se estima que un año adicional de educación del individuo, con el resto de factores constantes, provocaría un incremento del salario de 0.5691 um. El coeficiente es estadísticamente significativo al 5% (P valor = 0.000 < 0.05)
Antigüedad (0.18958) se estima que un año adicional de antigüedad del individuo, con el resto de factores constantes, provocaría un incremento del salario de 0.1895 um. El coeficiente es estadísticamente signficativo al 5% (P valor = 0.000 < 0.05)
El coeficiente de determinación (R2 = 0.30). La variación de la educación y la antigüedad permite explicar el 30% de las variaciones en el ssalario. (La fiabilidad es pobre, porque R2 < 50%).
anova(mod1)
## Analysis of Variance Table
##
## Response: wage
## Df Sum Sq Mean Sq F value Pr(>F)
## educ 1 1179.7 1179.73 123.43 < 2.2e-16 ***
## antigüedad 1 981.7 981.72 102.71 < 2.2e-16 ***
## Residuals 523 4999.0 9.56
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
El contraste de significación conjunta tiene una F de 113.1, con un p valor de 0.000 < 0.05. Tenemos evidencia empírica suficiente para rechazar H0 al nivel del 5%. Por tanto, las pendientes del modelo son conjuntamente significativas.
Representación gráfica
library(scatterplot3d)
graf <- scatterplot3d(x = educ, y = antigüedad, z = wage, pch = 16,
cex.lab = 1, highlight.3d = TRUE, type = "h",
xlab = 'Años de educación',
ylab = 'Experiencia (años)',
zlab = 'Salario (USD/h)')
graf$plane3d(mod1, lty.box = "solid", col = 'blue')
SCR <- deviance(mod1) # Suma del cuadrado de los residuos
SCR
## [1] 4998.965
gl <- df.residual(mod1) # Grados de libertadl del modelo
gl
## [1] 523
sigma2_gorro <- (SCR/gl) #Sigma gorro cuadrado
sigma2_gorro
## [1] 9.55825
(sigma(mod1))^2
## [1] 9.55825
library(e1071)
## Warning: package 'e1071' was built under R version 4.2.3
plot(density(mod1$residuals),
main = "Density plot: residuals",
ylab = "Frequency",
sub = paste("Skewness:",round(e1071::skewness(mod1$residuals), 3)))
polygon(density(mod1$residuals), col = "lightblue")
Solicitamos los gráficos mostrados a continuación para inspeccionar la
normalidad de los errores a través de la distribución empírica de los
residuos (que es lo que observamos).
Los mismos permiten comparar los cuantiles de los residuos con los cuantiles de una distribución normal. Si el proceso generador de datos fuera normal, los puntos estarían cerca de la recta.
qqnorm(mod1$residuals)
qqline(mod1$residuals)
shapiro.test(mod1$residuals)
##
## Shapiro-Wilk normality test
##
## data: mod1$residuals
## W = 0.88987, p-value < 2.2e-16
Se rechaza la hipótesis de normalidad.
El resultado del p-valor indica que debemos rechazar la hipótesis nula de distribución normal de los errores. Esto confirma lo sugerido por el análisis gráfico precedente.
Presentamos dos gráficos en los que se muestran los residuos contra cada uno de los regresores, los cuales se realizan con el siguiente código:
plot1 <- ggplot(data = datos, aes(educ, mod1$residuals)) +
geom_point() +
geom_smooth(method= "loess", color = "red") +
geom_hline(yintercept = 0) +
theme_bw()
plot2 <- ggplot(data = datos, aes(antigüedad, mod1$residuals)) +
geom_point() +
geom_smooth(method= "loess", color = "red") +
geom_hline(yintercept = 0) +
theme_bw()
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 4.2.3
grid.arrange(plot1, plot2)
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
El análisis gráfico indica problemas de heterocedasticidad y de correlación entre los residuos y e nivel de los regresores.
Una manera resumida de verlo es graficando los residuos contra los valores ajustados y, que es una combinación lineal de los regresores que estamos considerando.
ggplot(data = datos, aes(mod1$fitted.values, mod1$residuals))+
geom_point(size = 1, color = "blue", alpha = 0.5)+
geom_smooth(method= "loess", color = "red") +
geom_hline(yintercept = 0) +
theme_bw()
## `geom_smooth()` using formula = 'y ~ x'
#6.- Relación del género y los salarios
library(dplyr)
## Warning: package 'dplyr' was built under R version 4.2.3
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:gridExtra':
##
## combine
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
tabla_salarios <- datos %>%
group_by(female) %>%
summarise(cantidad = n(), Mediana = median(wage), Q1 = quantile(wage, 0.25),
Q3 = quantile(wage, 0.75), Promedio = mean(wage), Desvío = sd(wage),
Mínimo = min(wage), Máximo = max(wage))
print(tabla_salarios)
## # A tibble: 2 × 9
## female cantidad Mediana Q1 Q3 Promedio Desvío Mínimo Máximo
## <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 274 6 4.14 8.77 7.10 4.16 1.5 25.0
## 2 1 252 3.75 3 5.51 4.59 2.53 0.530 21.6
datos$genero <- factor(datos$female, labels = c("hombre", "mujer"))
boxplot(wage ~ genero, data=datos)
#7.- Estimar el modelo con el efecto de género
mod2 <- lm(wage ~ educ + antigüedad + female + educ*female + antigüedad*female, data=datos)
summary(mod2)
##
## Call:
## lm(formula = wage ~ educ + antigüedad + female + educ * female +
## antigüedad * female, data = datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2545 -1.6610 -0.5876 0.9475 14.1768
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.92955 0.80848 -2.387 0.017360 *
## educ 0.60265 0.05954 10.122 < 2e-16 ***
## antigüedad 0.20422 0.02136 9.560 < 2e-16 ***
## female 0.69183 1.24736 0.555 0.579380
## educ:female -0.14895 0.09571 -1.556 0.120265
## antigüedad:female -0.13867 0.04065 -3.411 0.000697 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.936 on 520 degrees of freedom
## Multiple R-squared: 0.3741, Adjusted R-squared: 0.3681
## F-statistic: 62.17 on 5 and 520 DF, p-value: < 2.2e-16
mod3 <- lm(wage ~ educ + antigüedad + antigüedad*female, data=datos)
summary(mod3)
##
## Call:
## lm(formula = wage ~ educ + antigüedad + antigüedad * female,
## data = datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.147 -1.695 -0.601 1.016 13.904
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.17766 0.64912 -1.814 0.070214 .
## educ 0.54501 0.04668 11.676 < 2e-16 ***
## antigüedad 0.20192 0.02134 9.462 < 2e-16 ***
## female -1.18506 0.31875 -3.718 0.000223 ***
## antigüedad:female -0.13631 0.04068 -3.351 0.000863 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.94 on 521 degrees of freedom
## Multiple R-squared: 0.3712, Adjusted R-squared: 0.3664
## F-statistic: 76.89 on 4 and 521 DF, p-value: < 2.2e-16
library(jtools)
## Warning: package 'jtools' was built under R version 4.2.3
summ(mod3, scale = TRUE, vifs = TRUE, part.corr = TRUE, confint = TRUE,
pvals = TRUE)
## MODEL INFO:
## Observations: 526
## Dependent Variable: wage
## Type: OLS linear regression
##
## MODEL FIT:
## F(4,521) = 76.89, p = 0.00
## R² = 0.37
## Adj. R² = 0.37
##
## Standard errors: OLS
## ----------------------------------------------------------------------
## Est. 2.5% 97.5% t val. p VIF
## ----------------------- ------- ------- ------- -------- ------ ------
## (Intercept) 6.70 6.35 7.05 37.13 0.00
## educ 1.51 1.26 1.76 11.68 0.00 1.01
## antigüedad 1.46 1.16 1.76 9.46 0.00 1.44
## female -1.88 -2.40 -1.36 -7.11 0.00 1.06
## antigüedad:female -0.98 -1.56 -0.41 -3.35 0.00 1.44
## ----------------------------------------------------------------------
##
## --------------------------------------------
## partial.r part.r
## ----------------------- ----------- --------
## (Intercept)
## educ 0.46 0.41
## antigüedad 0.38 0.33
## female -0.30 -0.25
## antigüedad:female -0.15 -0.12
## --------------------------------------------
##
## Continuous predictors are mean-centered and scaled by 1 s.d. The outcome variable remains in its original units.
VIFS: Como no son superiores al 10%, no tenemos problemas de multicolinealidad aproximada.
Todos lls coeficientes son estadísticamente significativos al 5%.
Modelos complementarios
library(huxtable)
## Warning: package 'huxtable' was built under R version 4.2.3
##
## Attaching package: 'huxtable'
## The following object is masked from 'package:dplyr':
##
## add_rownames
## The following object is masked from 'package:GGally':
##
## wrap
## The following object is masked from 'package:ggplot2':
##
## theme_grey
m0 <- lm(wage ~ educ, data = datos)
m1 <- lm(wage ~ educ + antigüedad, data = datos)
m2 <- lm(wage ~educ + antigüedad + female, data = datos)
m3 <- lm(wage ~educ + antigüedad + female + female*educ, data = datos)
m4 <- lm(wage ~educ + antigüedad + female + female*antigüedad, data = datos)
m5 <- lm(wage ~educ + antigüedad + female + female*antigüedad + female*educ, data = datos)
export_summs(m0, m1, m2, m3, m4, m5, scale = TRUE, vifs = FALSE,
part.corr = TRUE, confint = TRUE, pvals = TRUE)
## Registered S3 methods overwritten by 'broom':
## method from
## tidy.glht jtools
## tidy.summary.glht jtools
| Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | Model 6 | |
|---|---|---|---|---|---|---|
| (Intercept) | 5.90 *** | 5.90 *** | 6.75 *** | 6.74 *** | 6.70 *** | 6.68 *** |
| (0.15) | (0.13) | (0.18) | (0.18) | (0.18) | (0.18) | |
| educ | 1.50 *** | 1.58 *** | 1.49 *** | 1.64 *** | 1.51 *** | 1.67 *** |
| (0.15) | (0.14) | (0.13) | (0.17) | (0.13) | (0.16) | |
| antigüedad | 1.37 *** | 1.19 *** | 1.20 *** | 1.46 *** | 1.48 *** | |
| (0.14) | (0.13) | (0.13) | (0.15) | (0.15) | ||
| female | -1.79 *** | -1.79 *** | -1.88 *** | -1.89 *** | ||
| (0.27) | (0.27) | (0.26) | (0.26) | |||
| educ:female | -0.38 | -0.41 | ||||
| (0.27) | (0.27) | |||||
| antigüedad:female | -0.98 *** | -1.00 *** | ||||
| (0.29) | (0.29) | |||||
| N | 526 | 526 | 526 | 526 | 526 | 526 |
| R2 | 0.16 | 0.30 | 0.36 | 0.36 | 0.37 | 0.37 |
| All continuous predictors are mean-centered and scaled by 1 standard deviation. The outcome variable is in its original units. *** p < 0.001; ** p < 0.01; * p < 0.05. | ||||||
El mejor modelo es el m5
export_summs(m0, m1, m2, m3, m4, m5, scale = FALSE, vifs = FALSE,
part.corr = TRUE, confint = TRUE, pvals = TRUE)
| Model 1 | Model 2 | Model 3 | Model 4 | Model 5 | Model 6 | |
|---|---|---|---|---|---|---|
| (Intercept) | -0.90 | -2.22 *** | -0.85 | -1.53 | -1.18 | -1.93 * |
| (0.68) | (0.64) | (0.65) | (0.81) | (0.65) | (0.81) | |
| educ | 0.54 *** | 0.57 *** | 0.54 *** | 0.59 *** | 0.55 *** | 0.60 *** |
| (0.05) | (0.05) | (0.05) | (0.06) | (0.05) | (0.06) | |
| antigüedad | 0.19 *** | 0.16 *** | 0.17 *** | 0.20 *** | 0.20 *** | |
| (0.02) | (0.02) | (0.02) | (0.02) | (0.02) | ||
| female | -1.79 *** | -0.07 | -1.19 *** | 0.69 | ||
| (0.27) | (1.24) | (0.32) | (1.25) | |||
| educ:female | -0.14 | -0.15 | ||||
| (0.10) | (0.10) | |||||
| antigüedad:female | -0.14 *** | -0.14 *** | ||||
| (0.04) | (0.04) | |||||
| N | 526 | 526 | 526 | 526 | 526 | 526 |
| R2 | 0.16 | 0.30 | 0.36 | 0.36 | 0.37 | 0.37 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||||||
Trabajamos con el modelo 5
#8.- Contraste de significación conjunta R vs NR
summary(m5)
##
## Call:
## lm(formula = wage ~ educ + antigüedad + female + female * antigüedad +
## female * educ, data = datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2545 -1.6610 -0.5876 0.9475 14.1768
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.92955 0.80848 -2.387 0.017360 *
## educ 0.60265 0.05954 10.122 < 2e-16 ***
## antigüedad 0.20422 0.02136 9.560 < 2e-16 ***
## female 0.69183 1.24736 0.555 0.579380
## antigüedad:female -0.13867 0.04065 -3.411 0.000697 ***
## educ:female -0.14895 0.09571 -1.556 0.120265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.936 on 520 degrees of freedom
## Multiple R-squared: 0.3741, Adjusted R-squared: 0.3681
## F-statistic: 62.17 on 5 and 520 DF, p-value: < 2.2e-16
library(gmodels)
## Warning: package 'gmodels' was built under R version 4.2.3
library(car)
## Warning: package 'car' was built under R version 4.2.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 4.2.3
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
Wald <- linearHypothesis(m5, c("female=0", "antigüedad:female=0", "educ:female=0"))
Wald
| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
|---|---|---|---|---|---|
| 523 | 5e+03 | ||||
| 520 | 4.48e+03 | 3 | 517 | 20 | 2.74e-12 |
El p valor es cero. Tenemos evidencia empírica suficiente para rechazar H0 al 5%.
El efecto del género es conjuntamente significativo para explicar el salario al 5%.
Wald2 <- linearHypothesis(m5, c("antigüedad:female=0", "educ:female=0"))
Wald2
| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
|---|---|---|---|---|---|
| 522 | 4.6e+03 | ||||
| 520 | 4.48e+03 | 2 | 118 | 6.84 | 0.00117 |
El p valor es cero. Se rechaza la hipótesis conjunta de que las interacciones del género con la educación y la antigüedad no son significativas conjuntamente.
Es decir, sí son conjuntamente significativas.
Wald3 <- linearHypothesis(m5, c("female=0", "educ:female=0"))
Wald3
| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
|---|---|---|---|---|---|
| 522 | 4.62e+03 | ||||
| 520 | 4.48e+03 | 2 | 140 | 8.14 | 0.00033 |
El p valor es cero. Se rechaza la hipótesis conjunta. Se estima que el género y la interacción del género con la educación sí son estadísticamente significativos.
summary(m5)
##
## Call:
## lm(formula = wage ~ educ + antigüedad + female + female * antigüedad +
## female * educ, data = datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.2545 -1.6610 -0.5876 0.9475 14.1768
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.92955 0.80848 -2.387 0.017360 *
## educ 0.60265 0.05954 10.122 < 2e-16 ***
## antigüedad 0.20422 0.02136 9.560 < 2e-16 ***
## female 0.69183 1.24736 0.555 0.579380
## antigüedad:female -0.13867 0.04065 -3.411 0.000697 ***
## educ:female -0.14895 0.09571 -1.556 0.120265
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.936 on 520 degrees of freedom
## Multiple R-squared: 0.3741, Adjusted R-squared: 0.3681
## F-statistic: 62.17 on 5 and 520 DF, p-value: < 2.2e-16
Existe contradicción entre los test T y el de significación conjunta (MULTICOLINEALIDAD).
vif(m5)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## educ antigüedad female antigüedad:female
## 1.655602 1.450817 23.698604 1.710437
## educ:female
## 22.800583
Un modelo con variables ficticias o variables al cuadrado son normalmente problemáticos en términos de multicolinealidad aproximada.
anova(m5)
| Df | Sum Sq | Mean Sq | F value | Pr(>F) |
|---|---|---|---|---|
| 1 | 1.18e+03 | 1.18e+03 | 137 | 3.12e-28 |
| 1 | 982 | 982 | 114 | 3.52e-24 |
| 1 | 400 | 400 | 46.4 | 2.73e-11 |
| 1 | 97 | 97 | 11.3 | 0.00085 |
| 1 | 20.9 | 20.9 | 2.42 | 0.12 |
| 520 | 4.48e+03 | 8.62 |
Wald4 <- linearHypothesis(m5, c("educ:female=0"))
Wald4
| Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
|---|---|---|---|---|---|
| 521 | 4.5e+03 | ||||
| 520 | 4.48e+03 | 1 | 20.9 | 2.42 | 0.12 |
Sí se debería eliminar la interacción entre la educación y el género.
m6 <- lm(wage ~educ + antigüedad + female + female*antigüedad, data = datos)
summary(m6)
##
## Call:
## lm(formula = wage ~ educ + antigüedad + female + female * antigüedad,
## data = datos)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.147 -1.695 -0.601 1.016 13.904
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.17766 0.64912 -1.814 0.070214 .
## educ 0.54501 0.04668 11.676 < 2e-16 ***
## antigüedad 0.20192 0.02134 9.462 < 2e-16 ***
## female -1.18506 0.31875 -3.718 0.000223 ***
## antigüedad:female -0.13631 0.04068 -3.351 0.000863 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.94 on 521 degrees of freedom
## Multiple R-squared: 0.3712, Adjusted R-squared: 0.3664
## F-statistic: 76.89 on 4 and 521 DF, p-value: < 2.2e-16
vif(m6)
## there are higher-order terms (interactions) in this model
## consider setting type = 'predictor'; see ?vif
## educ antigüedad female antigüedad:female
## 1.014985 1.443902 1.543292 1.708055
anova(m6)
| Df | Sum Sq | Mean Sq | F value | Pr(>F) |
|---|---|---|---|---|
| 1 | 1.18e+03 | 1.18e+03 | 137 | 3.58e-28 |
| 1 | 982 | 982 | 114 | 3.96e-24 |
| 1 | 400 | 400 | 46.2 | 2.89e-11 |
| 1 | 97 | 97 | 11.2 | 0.000863 |
| 521 | 4.5e+03 | 8.64 |
Residuos
library(e1071) # librería necesaria para utilizar la función skewness()
plot(density(m6$residuals),
main = "Density Plot: residuos", ylab = "Frequency",
sub = paste("Skewness:", round(e1071::skewness(m6$residuals), 2)))
polygon(density(m6$residuals), col = "red")
qqnorm(m6$residuals)
qqline(m6$residuals)
shapiro.test(m6$residuals)
##
## Shapiro-Wilk normality test
##
## data: m6$residuals
## W = 0.88216, p-value < 2.2e-16
No hay normalidad.
ggplot(data = datos, aes(m6$fitted.values, m6$residuals))+
geom_point(size = 1, color = "blue", alpha = 0.5)+
geom_smooth(size = 1, color = "red", method = "loess", alphs = 0.8)+
geom_hline(yintercept = 0) +
theme_bw()
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning in geom_smooth(size = 1, color = "red", method = "loess", alphs = 0.8):
## Ignoring unknown parameters: `alphs`
## `geom_smooth()` using formula = 'y ~ x'
Tenemos un problema de heterocedasticidad.
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.2.3
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.2.3
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(m6)
##
## studentized Breusch-Pagan test
##
## data: m6
## BP = 42.669, df = 4, p-value = 1.212e-08
El p valor es cero, siendo inferior a 0.05. Tenemos un problema de heterocedasticidad.
library(sandwich)
## Warning: package 'sandwich' was built under R version 4.2.3
m7 <- coeftest(m6, vocv = vcovHC (m6, type = "HC3"))
summary(m7)
## Estimate Std. Error t value Pr(>|t|)
## Min. :-1.1851 Min. :0.02134 Min. :-3.718 Min. :0.0000000
## 1st Qu.:-1.1777 1st Qu.:0.04068 1st Qu.:-3.351 1st Qu.:0.0000000
## Median :-0.1363 Median :0.04668 Median :-1.814 Median :0.0002227
## Mean :-0.3504 Mean :0.21531 Mean : 2.451 Mean :0.0142601
## 3rd Qu.: 0.2019 3rd Qu.:0.31875 3rd Qu.: 9.462 3rd Qu.:0.0008633
## Max. : 0.5450 Max. :0.64912 Max. :11.676 Max. :0.0702144
library(huxtable)
library(jtools)
export_summs(m5,m6, m7) # modelo 1 es m5, modelo 2 es m6, modelo 3 es m7
## Original model not retained as part of coeftest object. For additional model summary information (r.squared, df, etc.), consider passing `glance.coeftest()` an object where the underlying model has been saved, i.e.`lmtest::coeftest(..., save = TRUE)`.
## This message is displayed once per session.
| Model 1 | Model 2 | Model 3 | |
|---|---|---|---|
| (Intercept) | -1.93 * | -1.18 | -1.18 |
| (0.81) | (0.65) | (0.65) | |
| educ | 0.60 *** | 0.55 *** | 0.55 *** |
| (0.06) | (0.05) | (0.05) | |
| antigüedad | 0.20 *** | 0.20 *** | 0.20 *** |
| (0.02) | (0.02) | (0.02) | |
| female | 0.69 | -1.19 *** | -1.19 *** |
| (1.25) | (0.32) | (0.32) | |
| antigüedad:female | -0.14 *** | -0.14 *** | -0.14 *** |
| (0.04) | (0.04) | (0.04) | |
| educ:female | -0.15 | ||
| (0.10) | |||
| N | 526 | 526 | 526 |
| R2 | 0.37 | 0.37 | |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | |||
#9.- INFORME DE LA SESIÓN
sesion_info <- devtools::session_info()
dplyr::select(
tibble::as_tibble(sesion_info$packages),
c(package, loadedversion, source)
)
| package | loadedversion | source |
|---|---|---|
| abind | 1.4-5 | CRAN (R 4.2.0) |
| assertthat | 0.2.1 | CRAN (R 4.2.3) |
| backports | 1.4.1 | CRAN (R 4.2.0) |
| broom | 1.0.5 | CRAN (R 4.2.3) |
| broom.mixed | 0.2.9.4 | CRAN (R 4.2.3) |
| bslib | 0.4.2 | CRAN (R 4.2.2) |
| cachem | 1.0.6 | CRAN (R 4.2.2) |
| callr | 3.7.3 | CRAN (R 4.2.3) |
| car | 3.1-2 | CRAN (R 4.2.3) |
| carData | 3.0-5 | CRAN (R 4.2.3) |
| class | 7.3-20 | CRAN (R 4.2.2) |
| cli | 3.6.0 | CRAN (R 4.2.2) |
| codetools | 0.2-18 | CRAN (R 4.2.2) |
| colorspace | 2.1-0 | CRAN (R 4.2.3) |
| commonmark | 1.8.1 | CRAN (R 4.2.2) |
| crayon | 1.5.2 | CRAN (R 4.2.3) |
| crosstalk | 1.2.0 | CRAN (R 4.2.3) |
| data.table | 1.14.8 | CRAN (R 4.2.3) |
| devtools | 2.4.5 | CRAN (R 4.2.3) |
| digest | 0.6.31 | CRAN (R 4.2.2) |
| dplyr | 1.1.2 | CRAN (R 4.2.3) |
| e1071 | 1.7-13 | CRAN (R 4.2.3) |
| ellipsis | 0.3.2 | CRAN (R 4.2.2) |
| evaluate | 0.20 | CRAN (R 4.2.2) |
| fansi | 1.0.4 | CRAN (R 4.2.3) |
| farver | 2.1.1 | CRAN (R 4.2.3) |
| fastmap | 1.1.0 | CRAN (R 4.2.2) |
| forcats | 1.0.0 | CRAN (R 4.2.3) |
| fs | 1.6.1 | CRAN (R 4.2.2) |
| furrr | 0.3.1 | CRAN (R 4.2.3) |
| future | 1.33.0 | CRAN (R 4.2.3) |
| gdata | 2.19.0 | CRAN (R 4.2.3) |
| generics | 0.1.3 | CRAN (R 4.2.3) |
| GGally | 2.1.2 | CRAN (R 4.2.3) |
| ggplot2 | 3.4.2 | CRAN (R 4.2.3) |
| globals | 0.16.2 | CRAN (R 4.2.2) |
| glue | 1.6.2 | CRAN (R 4.2.2) |
| gmodels | 2.18.1.1 | CRAN (R 4.2.3) |
| gridExtra | 2.3 | CRAN (R 4.2.3) |
| gtable | 0.3.3 | CRAN (R 4.2.3) |
| gtools | 3.9.4 | CRAN (R 4.2.3) |
| highr | 0.10 | CRAN (R 4.2.2) |
| htmltools | 0.5.4 | CRAN (R 4.2.2) |
| htmlwidgets | 1.6.2 | CRAN (R 4.2.3) |
| httpuv | 1.6.11 | CRAN (R 4.2.3) |
| httr | 1.4.6 | CRAN (R 4.2.3) |
| huxtable | 5.5.2 | CRAN (R 4.2.3) |
| jquerylib | 0.1.4 | CRAN (R 4.2.2) |
| jsonlite | 1.8.4 | CRAN (R 4.2.2) |
| jtools | 2.2.2 | CRAN (R 4.2.3) |
| knitr | 1.42 | CRAN (R 4.2.2) |
| labeling | 0.4.2 | CRAN (R 4.2.0) |
| later | 1.3.1 | CRAN (R 4.2.3) |
| lattice | 0.20-45 | CRAN (R 4.2.2) |
| lazyeval | 0.2.2 | CRAN (R 4.2.3) |
| lifecycle | 1.0.3 | CRAN (R 4.2.2) |
| listenv | 0.9.0 | CRAN (R 4.2.3) |
| lmtest | 0.9-40 | CRAN (R 4.2.3) |
| magrittr | 2.0.3 | CRAN (R 4.2.2) |
| MASS | 7.3-60 | CRAN (R 4.2.3) |
| Matrix | 1.6-0 | CRAN (R 4.2.3) |
| memoise | 2.0.1 | CRAN (R 4.2.2) |
| mgcv | 1.8-41 | CRAN (R 4.2.2) |
| mime | 0.12 | CRAN (R 4.2.0) |
| miniUI | 0.1.1.1 | CRAN (R 4.2.3) |
| munsell | 0.5.0 | CRAN (R 4.2.3) |
| nlme | 3.1-160 | CRAN (R 4.2.2) |
| pander | 0.6.5 | CRAN (R 4.2.3) |
| parallelly | 1.36.0 | CRAN (R 4.2.3) |
| pillar | 1.9.0 | CRAN (R 4.2.3) |
| pkgbuild | 1.4.2 | CRAN (R 4.2.3) |
| pkgconfig | 2.0.3 | CRAN (R 4.2.3) |
| pkgload | 1.3.2.1 | CRAN (R 4.2.3) |
| plotly | 4.10.2 | CRAN (R 4.2.3) |
| plyr | 1.8.8 | CRAN (R 4.2.3) |
| prettyunits | 1.1.1 | CRAN (R 4.2.3) |
| processx | 3.8.2 | CRAN (R 4.2.3) |
| profvis | 0.3.8 | CRAN (R 4.2.3) |
| promises | 1.2.0.1 | CRAN (R 4.2.3) |
| proxy | 0.4-27 | CRAN (R 4.2.3) |
| ps | 1.7.5 | CRAN (R 4.2.3) |
| purrr | 1.0.1 | CRAN (R 4.2.3) |
| R6 | 2.5.1 | CRAN (R 4.2.2) |
| RColorBrewer | 1.1-3 | CRAN (R 4.2.0) |
| Rcpp | 1.0.11 | CRAN (R 4.2.3) |
| remotes | 2.4.2.1 | CRAN (R 4.2.3) |
| reshape | 0.8.9 | CRAN (R 4.2.3) |
| rlang | 1.1.1 | CRAN (R 4.2.3) |
| rmarkdown | 2.20 | CRAN (R 4.2.2) |
| rstudioapi | 0.15.0 | CRAN (R 4.2.3) |
| sandwich | 3.0-2 | CRAN (R 4.2.3) |
| sass | 0.4.5 | CRAN (R 4.2.2) |
| scales | 1.2.1 | CRAN (R 4.2.3) |
| scatterplot3d | 0.3-44 | CRAN (R 4.2.3) |
| sessioninfo | 1.2.2 | CRAN (R 4.2.3) |
| shiny | 1.7.4.1 | CRAN (R 4.2.3) |
| stringi | 1.7.12 | CRAN (R 4.2.2) |
| stringr | 1.5.0 | CRAN (R 4.2.2) |
| tibble | 3.2.1 | CRAN (R 4.2.3) |
| tidyr | 1.3.0 | CRAN (R 4.2.3) |
| tidyselect | 1.2.0 | CRAN (R 4.2.3) |
| urlchecker | 1.0.1 | CRAN (R 4.2.3) |
| usethis | 2.2.2 | CRAN (R 4.2.3) |
| utf8 | 1.2.3 | CRAN (R 4.2.3) |
| vctrs | 0.6.3 | CRAN (R 4.2.3) |
| viridisLite | 0.4.2 | CRAN (R 4.2.3) |
| withr | 2.5.0 | CRAN (R 4.2.3) |
| wooldridge | 1.4-3 | CRAN (R 4.2.3) |
| xfun | 0.37 | CRAN (R 4.2.2) |
| xtable | 1.8-4 | CRAN (R 4.2.3) |
| yaml | 2.3.7 | CRAN (R 4.2.2) |
| zoo | 1.8-12 | CRAN (R 4.2.3) |