https://bookdown.org/rwnahhas/RMPH/slr-model-with-a-continuous-predictor.html
El modelo de regresión simple tiene como objetivo analizar la asociación entre una variable de respuesta o dependiente (cuantitativa, por ahora) y una variable explicativa o independiente.
Por ejemplo podríamos preguntarnos si el pib de los países tiene un efecto en la violencia.
Vamos a denotar a “y” como la variable respuesta y a “x” como la variable explicativa. Queremos saber cómo los valores de “y” (en promedio) tienden a cambiar dados los cambios en la variable “x” (en promedio).
La fórmula y= alfa+betax, nos dice que “y” es una función lineal de “x” y que vamos a tener un intercepto y una pendiente. En el contexto de la regresión, estos serán nuestros coeficientes.
Es importante que antes de efectuar el análisis de regresión, veamos un diagrama de dispersión para evaluar si existe relación lineal entre las variables.
Una vez que analizamos la posibilidad de estimar el modelo, vamos a tener nuestra ecuación estimada. La ecuación predicha o estimada, es la mejor línea o el mejor ajuste que minimza el error. La difrencia entre el valor observado y el predicho, es el residuo.
b = sumatoria(x-xbarra)(y-ybarra)/sumatoria(x-xbarra)^2 a = ybarra-bxbarra
La SSE (suma de los residuos cuadrados): sumatoria(yobservada-ypredicha)^2 La TSS(suma total): sumatoria(yobservada-ybarra)^2 La MSE (suma del modelo): TSE-SSE
Coeficiente de determinación: TSS-SSE/TSS -El coeficiente de determinación va entre 0 y 1
Cuando la regresión tiene p parámetros desconocidos, los grados de libertad serán df=n-p.
Paquetería
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.1 ✔ purrr 0.3.5
## ✔ tibble 3.1.8 ✔ dplyr 1.0.10
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2 ✔ forcats 0.5.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(janitor)
##
## Attaching package: 'janitor'
##
## The following objects are masked from 'package:stats':
##
## chisq.test, fisher.test
library(knitr)
library(kableExtra)
##
## Attaching package: 'kableExtra'
##
## The following object is masked from 'package:dplyr':
##
## group_rows
library(broom)
library(jtools)
library(dotwhisker)
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
##
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(lm.beta)
library(sjPlot)
## Learn more about sjPlot with 'browseVignettes("sjPlot")'.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
##
## The following object is masked from 'package:dplyr':
##
## recode
##
## The following object is masked from 'package:purrr':
##
## some
Lectura de datos
base <- read.csv("C:/Users/molar/Dropbox/2023_Trabajos/FLACSO/Datos/life_expec_who.csv")
Vamos a limpiar los nombres
base<- clean_names(base)
Quitar notación científica
options(scipen = 999)
Filtrar la base
base_modif <- base %>%
filter(year==2015) %>%
filter(!is.na(life_expectancy)) %>%
filter(!is.na(schooling)) %>%
filter(!is.na(gdp))
El material de la base de datos, lo pueden consultar aquí: https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who?resource=download
Vamos a hacer un análisis de regresión para explicar la esperanza de vida a partir de la escolaridad.
Primero podemos ver cómo se distribuyen ambas variables.
base %>%
ggplot(aes(x=schooling))+
geom_histogram()+
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 163 rows containing non-finite values (`stat_bin()`).
base %>%
ggplot(aes(x=life_expectancy))+
geom_histogram()+
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 10 rows containing non-finite values (`stat_bin()`).
Luego hacemos un diagrama de dispersión básico
base %>%
ggplot(aes(x=schooling, y=life_expectancy))+
geom_point()+
geom_smooth(method="lm")+
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 170 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 170 rows containing missing values (`geom_point()`).
Vemos que a lo mejor sí conviene estimar el modelo de regresión.
Podemos calcular la ecuación estimada a mano.
base$var1 <- base$schooling-mean(base$schooling, na.rm=T)
base$var2 <- base$life_expectancy-mean(base$life_expectancy, na.rm=T)
base$var3 <- base$var1*base$var2
Entonces el numerador sería
base %>%
summarise(suma=sum(var3, na.rm=T))
## suma
## 1 65181.43
Y el denominador
base$var4 <- base$var1*base$var1
base %>%
summarise(suma=sum(var4, na.rm=T))
## suma
## 1 31297.22
Entonces b es
b <- 65181.43/31297.22
Intercepto
a <- mean(base$life_expectancy, na.rm=T)-(2.08*mean(base$schooling, na.rm=T))
Forma rápida
modelo <- lm(life_expectancy~schooling, data=base_modif)
summary(modelo)
##
## Call:
## lm(formula = life_expectancy ~ schooling, data = base_modif)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5807 -2.8686 0.3736 3.3040 10.3697
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.8409 1.7214 24.31 <0.0000000000000002 ***
## schooling 2.2930 0.1286 17.84 <0.0000000000000002 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.665 on 151 degrees of freedom
## Multiple R-squared: 0.6781, Adjusted R-squared: 0.676
## F-statistic: 318.1 on 1 and 151 DF, p-value: < 0.00000000000000022
El intervalo de confianza
confint(modelo)
## 2.5 % 97.5 %
## (Intercept) 38.43974 45.242107
## schooling 2.03897 2.546961
Tabla anova
anova <- aov(life_expectancy ~ schooling, data = base)
summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## schooling 1 137101 137101 3599 <0.0000000000000002 ***
## Residuals 2766 105355 38
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 170 observations deleted due to missingness
La columna DF muestra los grados de libertad para la variable independiente y los grados de libertad para los residuales.
Sum Sq muestra la variación total con respecto a la media global.
Mean Sq es la división de Sum Sq entre los grados de libertad.
F es el estadístico de preuba del modelo.
Pr(>F) es el pvalor de la prueba F.
Para hacer una tabla.
Guardamos la información del modelo.
modelo.info <- summary(modelo)
Extraemos los coeficientes
modelo.Coef <- as.data.frame(modelo.info$coefficients[, 1:2])
Calculamos los intervalos de confianza
modelo.Coef <- within(modelo.Coef, {
Lower <- Estimate - qt(p = 0.95, df = modelo.info$df[2]) * `Std. Error`
Upper <- Estimate + qt(p = 0.95, df = modelo.info$df[2]) * `Std. Error`
Variable <- row.names(modelo.Coef)
})
Vemos la tabla
modelo.Coef
## Estimate Std. Error Variable Upper Lower
## (Intercept) 41.840924 1.7214220 (Intercept) 44.689890 38.991958
## schooling 2.292966 0.1285533 schooling 2.505722 2.080209
Graficamos
ggplot(data = modelo.Coef, mapping = aes(x = Estimate, y = Variable)) +
geom_point() +
geom_errorbarh(aes(xmin = Lower, xmax = Upper), height = .4) +
ggtitle("Coeficientes del modelo de regresión") +
theme(plot.title = element_text(hjust = 0.5))
Otra forma: https://cran.r-project.org/web/packages/jtools/vignettes/summ.html#plot_summs()_and_plot_coefs()
https://www.danieldsjoberg.com/gtsummary/articles/tbl_regression.html
Datos: https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who?resource=download Más datos: https://vincentarelbundock.github.io/Rdatasets/articles/data.html Dataworld
Presentación de resultados en tabla para markdown: https://zief0002.github.io/book-8252/pretty-printing-tables-in-markdown.html
lm(life_expectancy ~ schooling, data = base) %>%
tidy() %>%
kable(
caption = "Modelo de regresión lineal simple",
col.names = c("Variable", "Coeficiente", "SE", "t", "p"),
digits = c(0, 2, 3, 2, 3)
)
| Variable | Coeficiente | SE | t | p |
|---|---|---|---|---|
| (Intercept) | 44.11 | 0.437 | 100.99 | 0 |
| schooling | 2.10 | 0.035 | 60.00 | 0 |
Ahora vamos a explicar la esperanza de vida en términos de la escolaridad y el gdp
base_modif <- base %>%
filter(year==2015) %>%
filter(!is.na(life_expectancy)) %>%
filter(!is.na(schooling)) %>%
filter(!is.na(gdp))
Lo vemos
modelo2 <- lm(life_expectancy~schooling+gdp, data=base_modif)
summary(modelo2)
##
## Call:
## lm(formula = life_expectancy ~ schooling + gdp, data = base_modif)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.5386 -2.8642 0.6052 3.0684 10.8943
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.03414650 1.79643351 23.955 <0.0000000000000002 ***
## schooling 2.16048601 0.14216791 15.197 <0.0000000000000002 ***
## gdp 0.00007437 0.00003570 2.083 0.0389 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.615 on 150 degrees of freedom
## Multiple R-squared: 0.6872, Adjusted R-squared: 0.683
## F-statistic: 164.8 on 2 and 150 DF, p-value: < 0.00000000000000022
Intervalos de confianza
confint(modelo2, level=0.95) # intervalos de confianza
## 2.5 % 97.5 %
## (Intercept) 39.484563948421 46.5837290523
## schooling 1.879575683472 2.4413963440
## gdp 0.000003837348 0.0001449069
Anova
anova(modelo2) # anova
## Analysis of Variance Table
##
## Response: life_expectancy
## Df Sum Sq Mean Sq F value Pr(>F)
## schooling 1 6924.5 6924.5 325.1862 < 0.0000000000000002 ***
## gdp 1 92.4 92.4 4.3406 0.03891 *
## Residuals 150 3194.1 21.3
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Graficar coeficientes Con el paquete dotwhisker https://cran.r-project.org/web/packages/dotwhisker/vignettes/dotwhisker-vignette.html
coef_graf<- dwplot(list(modelo2), show_intercept = TRUE,
vline = geom_vline(xintercept = 0, colour = "grey60", linetype = 2)) %>%
relabel_predictors(c(schooling = "Años promedio de escolaridad", gdp = "PIB ")) +
theme_bw() +
xlab("Coeficientes estimados") +
ylab("Esperanza de vida") +
geom_vline(xintercept = 0, colour = "grey60", linetype = 2) +
ggtitle("Modelo de regresión lineal múltiple")
coef_graf
Con el paquete jtools https://cran.r-project.org/web/packages/jtools/vignettes/summ.html#plot_summs_and_plot_coefs
plot_coefs(modelo2, scale = TRUE)
## Loading required namespace: broom.mixed
Todo cambio en x, representa un cambio en y (manteniendo constante lo demás) La linealidad la podemos ver comparando los residuales con valores de y
modelo.metrics <- augment(modelo2)
modelo.metrics %>%
ggplot(aes(.fitted, .resid)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'
Lo ideal es que la línea recta esté aproximada a cero
(En el resultado de la regresión también podemos hablar de la linealidad)
Los residuos siguen una distribución normal.
Graficamos
modelo.metrics %>%
ggplot(aes(.resid)) +
geom_density()
Hacemos la prueba de hipótesis
shapiro.test(modelo.metrics$.resid)
##
## Shapiro-Wilk normality test
##
## data: modelo.metrics$.resid
## W = 0.97727, p-value = 0.01235
Varianza constante.
modelo.metrics %>%
ggplot(aes(.fitted, .std.resid)) +
geom_point()
bptest(modelo2)
##
## studentized Breusch-Pagan test
##
## data: modelo2
## BP = 5.6967, df = 2, p-value = 0.05794
El valor de u (factores distintos a x que afectan y), no depende del valor de x
modelo.metrics %>%
ggplot(aes(schooling, .resid)) +
geom_point()
modelo.metrics %>%
ggplot(aes(gdp, .resid)) +
geom_point()
Correlación
cor.test(modelo.metrics$.resid, modelo.metrics$schooling, method= "pearson", use= "pairwise.complete.obs")
##
## Pearson's product-moment correlation
##
## data: modelo.metrics$.resid and modelo.metrics$schooling
## t = -0.0000000000000069822, df = 151, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1586781 0.1586781
## sample estimates:
## cor
## -0.0000000000000005682042
cor.test(modelo.metrics$.resid, modelo.metrics$gdp, method= "pearson", use= "pairwise.complete.obs")
##
## Pearson's product-moment correlation
##
## data: modelo.metrics$.resid and modelo.metrics$gdp
## t = -0.0000000000000047928, df = 151, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.1586781 0.1586781
## sample estimates:
## cor
## -0.0000000000000003900353
El valor esperado de los errores tiende a 0.
mean(modelo.metrics$.resid)
## [1] 0.00000000000002800383
Prueba Durbin Watson
dwtest(modelo2)
##
## Durbin-Watson test
##
## data: modelo2
## DW = 2.1944, p-value = 0.8879
## alternative hypothesis: true autocorrelation is greater than 0
vif(modelo2)
## schooling gdp
## 1.250086 1.250086
Dado que ninguno de los valores es grande (>10) no hay multicolinealidad.
Podemos transformar nuestra variable de gdp a logaritmo.
base_modif$ln_gdp <- log(base_modif$gdp)
Veamos la variable
base_modif %>%
ggplot(aes(ln_gdp))+
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Estimamos un nuevo modelo
modelo3 <- lm(life_expectancy~schooling+ln_gdp, data=base_modif)
summary(modelo3)
##
## Call:
## lm(formula = life_expectancy ~ schooling + ln_gdp, data = base_modif)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.9738 -2.7413 0.5747 3.3512 10.5633
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 40.0891 2.0386 19.665 <0.0000000000000002 ***
## schooling 2.1588 0.1534 14.075 <0.0000000000000002 ***
## ln_gdp 0.4473 0.2822 1.585 0.115
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.642 on 150 degrees of freedom
## Multiple R-squared: 0.6834, Adjusted R-squared: 0.6792
## F-statistic: 161.9 on 2 and 150 DF, p-value: < 0.00000000000000022
La interpretación es distinta.
(Ver página 46 de Wooldridge)
base_modif$ln_life_expectancy <- log(base_modif$life_expectancy)
modelo4 <- lm(ln_life_expectancy~schooling+gdp, data=base_modif)
summary(modelo4)
##
## Call:
## lm(formula = ln_life_expectancy ~ schooling + gdp, data = base_modif)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.252909 -0.039535 0.008504 0.047082 0.141750
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.8532005334 0.0269468340 142.993 <0.0000000000000002 ***
## schooling 0.0311503842 0.0021325449 14.607 <0.0000000000000002 ***
## gdp 0.0000009473 0.0000005355 1.769 0.0789 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06922 on 150 degrees of freedom
## Multiple R-squared: 0.6663, Adjusted R-squared: 0.6619
## F-statistic: 149.8 on 2 and 150 DF, p-value: < 0.00000000000000022
modelo5 <- lm(ln_life_expectancy~schooling+ln_gdp, data=base_modif)
summary(modelo5)
##
## Call:
## lm(formula = ln_life_expectancy ~ schooling + ln_gdp, data = base_modif)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.25900 -0.04063 0.00836 0.04917 0.13780
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.813260 0.030486 125.082 <0.0000000000000002 ***
## schooling 0.030943 0.002294 13.490 <0.0000000000000002 ***
## ln_gdp 0.006317 0.004220 1.497 0.137
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06942 on 150 degrees of freedom
## Multiple R-squared: 0.6644, Adjusted R-squared: 0.6599
## F-statistic: 148.5 on 2 and 150 DF, p-value: < 0.00000000000000022
Seleccionamos el modelo 3 para hacer pruebas de datos atípicos
https://fhernanb.github.io/libro_regresion/diag2.html#distancia-de-cook
Es una medida de cómo influye la observación i-ésima sobre la
estimación de
β al ser retirada del conjunto de datos.
Son puntos influyentes los que presenten D=4/n-p-2
cooks.distance(modelo3)
## 1 2 3 4
## 0.0000152609620583 0.0013326131588175 0.0000609316199086 0.0464398360854721
## 5 6 7 8
## 0.0009762527936573 0.0097215380936186 0.0065259818829327 0.0335052876677183
## 9 10 11 12
## 0.0023847296763829 0.0115874043179398 0.0003295250254510 0.0283956174812847
## 13 14 15 16
## 0.0009079591788791 0.0051937729072484 0.0005541195928025 0.0000361252082947
## 17 18 19 20
## 0.0069729021194751 0.0000684607862395 0.0010140323528844 0.0053387289059752
## 21 22 23 24
## 0.0006913529835590 0.0004346606516754 0.0005934292142736 0.0000465422058382
## 25 26 27 28
## 0.0249665919632421 0.0000254393287296 0.0005926593807071 0.0134617144634662
## 29 30 31 32
## 0.0020665572627654 0.0170976281801327 0.0178299919467393 0.0078147736733491
## 33 34 35 36
## 0.0028005536882837 0.0014818508094967 0.0021421339891368 0.0063183730685041
## 37 38 39 40
## 0.0001000268799098 0.0086089180397428 0.0147206020195278 0.0002589993524672
## 41 42 43 44
## 0.0373586852844952 0.0002904666816929 0.0025507633505177 0.0001624287729098
## 45 46 47 48
## 0.0075188324429093 0.0007746180661369 0.0050712748655420 0.0080409604400490
## 49 50 51 52
## 0.0001059039160516 0.0024053443368847 0.0053475659826412 0.0000432169860648
## 53 54 55 56
## 0.0002641529937740 0.0041810822623516 0.0027597779327475 0.0045375180064077
## 57 58 59 60
## 0.0062769063585186 0.0028121144659491 0.0044574738547696 0.0000068369978452
## 61 62 63 64
## 0.0001750083802268 0.0074494528235076 0.0010550924312916 0.0031722374545900
## 65 66 67 68
## 0.0000023349573595 0.0006632079410057 0.0036903699561217 0.0040257124072187
## 69 70 71 72
## 0.0040516483180361 0.0183151825915511 0.0029864347492673 0.0137636921313313
## 73 74 75 76
## 0.0022937397940128 0.0065288150984576 0.0022437200995894 0.0009057531384611
## 77 78 79 80
## 0.0007584569458715 0.0045458373573546 0.0017448299838654 0.0568322526766040
## 81 82 83 84
## 0.0020429052990234 0.0113056205952214 0.0129325077505058 0.0020227755784420
## 85 86 87 88
## 0.0156345133556952 0.0013242416651289 0.0099592568698121 0.0033145935280630
## 89 90 91 92
## 0.0100812116552730 0.0008696869989883 0.0010463434475320 0.0027815394107485
## 93 94 95 96
## 0.0067109982318748 0.0000415882490700 0.0025362394495025 0.0074477998091452
## 97 98 99 100
## 0.0042161269404682 0.0020989489485771 0.0000056296892965 0.0022579357229904
## 101 102 103 104
## 0.0094693970600864 0.0146409390793957 0.0488757965499172 0.0314332925570057
## 105 106 107 108
## 0.0000896679114399 0.0018713417496201 0.0144529689960953 0.0078584888366887
## 109 110 111 112
## 0.0055631654272090 0.0223341792830110 0.0000238600261144 0.0012806360118160
## 113 114 115 116
## 0.0011758431414322 0.0131395991882986 0.0001362200667794 0.0004358633943257
## 117 118 119 120
## 0.0004300606945287 0.0006561329311216 0.0000008066387638 0.0038809701243981
## 121 122 123 124
## 0.0091349610892256 0.0000459434117352 0.0000574346229891 0.0415389920823922
## 125 126 127 128
## 0.0123002796727313 0.0204249584297862 0.0071551348337803 0.0120429491149783
## 129 130 131 132
## 0.0148930632419581 0.0000007315601437 0.0000915140431844 0.0183701910003005
## 133 134 135 136
## 0.0000002016906090 0.0147034257643188 0.0034180985280767 0.0047835562398229
## 137 138 139 140
## 0.0007217589040399 0.0003283389421366 0.0004775688936504 0.0160996404471349
## 141 142 143 144
## 0.0000200043782110 0.0001597222340837 0.0000000005714558 0.0001583872100959
## 145 146 147 148
## 0.0003402350301434 0.0011980704735186 0.0056631870257149 0.0023469713045343
## 149 150 151 152
## 0.0001844799911948 0.0000066061084014 0.0110878958659723 0.0086517204174718
## 153
## 0.0032591139892038
Graficamos para ver quiénes son
library(car)
influenceIndexPlot(modelo3, vars="Cook")
Otra forma
cutoff <- 4 / (153-2-2) # D=4/n-p-2
plot(modelo3, which=4, cook.levels=cutoff, las=1)
abline(h=cutoff, lty="dashed", col="dodgerblue2")
DFBETAS
Es una medida que indica cuánto cambia el coeficiente de regresión, en unidades de desviación estándar, si se omitiera la i -ésima observación.
dffits(modelo3)
## 1 2 3 4 5
## 0.00674378340 0.06312795381 0.01347605107 -0.38781614544 0.05397283897
## 6 7 8 9 10
## -0.17099258535 0.13993171031 -0.31799643937 0.08437147412 0.18617712840
## 11 12 13 14 15
## 0.03134180031 0.29357176317 -0.05204717374 -0.12497013952 0.04064892154
## 16 17 18 19 20
## -0.01037603616 -0.14501638542 -0.01428471835 0.05504256543 -0.12673783020
## 21 22 23 24 25
## -0.04541717315 0.03600711282 -0.04207709499 0.01177710358 -0.27386718830
## 26 27 28 29 30
## 0.00870717716 0.04204958603 -0.20255480452 0.07853269572 -0.22683817650
## 31 32 33 34 35
## -0.23172487590 0.15332556355 0.09156615611 0.06651550948 -0.08005065197
## 36 37 38 39 40
## 0.13792267676 0.01726630714 0.16109438370 0.21318910838 0.02778336097
## 41 42 43 44 45
## 0.33594354368 0.02942980358 0.08730973312 0.02200572141 -0.15027386273
## 46 47 48 49 50
## -0.04806153741 0.12319739083 -0.15600359209 0.01776588241 0.08474260801
## 51 52 53 54 55
## -0.12679035227 0.01134910074 -0.02805921709 -0.11215861188 0.09071835982
## 56 57 58 59 60
## -0.11668472436 0.13730968158 -0.09166577657 -0.11553153795 0.00451380095
## 61 62 63 64 65
## 0.02283904349 0.15010394622 -0.05611483842 -0.09730899127 -0.00263784237
## 66 67 68 69 70
## -0.04449844761 0.10503830347 -0.10966465533 0.11005809211 0.23450145570
## 71 72 73 74 75
## 0.09466047904 0.20365363262 0.08278979981 -0.13965539309 -0.08196220627
## 76 77 78 79 80
## -0.05201361777 0.04755442747 -0.11672775448 0.07221574062 -0.42093108162
## 81 82 83 84 85
## -0.07812083555 -0.18474068698 0.19869653851 0.07766806794 -0.21791187463
## 86 87 88 89 90
## 0.06288208237 0.17359486110 -0.09952483661 0.17418593936 0.05092671514
## 91 92 93 94 95
## -0.05588987303 0.09125251892 -0.14249305511 -0.01113297400 0.08721641085
## 96 97 98 99 100
## -0.14955423819 0.11233185284 -0.07922558834 -0.00409593108 -0.08208344645
## 101 102 103 104 105
## -0.16834286355 0.21046762113 0.38511671945 -0.31172978923 -0.01634718801
## 106 107 108 109 110
## 0.07475550616 0.20856718109 0.15373859671 0.12919764578 0.25889764891
## 111 112 113 114 115
## -0.00843247316 -0.06182390067 0.05922482491 0.19845040703 -0.02015036687
## 116 117 118 119 120
## -0.03605360515 0.03580275139 0.04425616953 -0.00155041532 -0.10779817873
## 121 122 123 124 125
## 0.16542597334 0.01170155658 -0.01308339176 -0.36073848916 0.19219102109
## 126 127 128 129 130
## 0.24866632683 0.14660093817 -0.19195412499 0.21113784167 -0.00147649938
## 131 132 133 134 135
## 0.01651605343 0.23491167376 0.00077526653 -0.21227631554 0.10119574033
## 136 137 138 139 140
## 0.11980970376 0.04641114822 0.03129228293 -0.03774668659 -0.22181790497
## 141 142 143 144 145
## -0.00772101559 -0.02181834309 0.00004126667 0.02172890392 -0.03184709905
## 146 147 148 149 150
## -0.05980137500 -0.13047266439 0.08390691322 -0.02344952512 -0.00443695355
## 151 152 153
## 0.18283812739 -0.16240143500 0.09865180792
dfbetas(modelo3)
## (Intercept) schooling ln_gdp
## 1 0.005660185140 -0.00274042992 -0.001964521560
## 2 -0.010361732846 0.01633517382 0.004713397560
## 3 -0.002765782233 0.00418252664 0.000837144858
## 4 -0.085103431156 0.24506607486 -0.193265030753
## 5 -0.022809947606 -0.01284730925 0.038796863681
## 6 0.111634350451 -0.09948501286 -0.029043805441
## 7 0.084639600543 0.05352970718 -0.112535396341
## 8 0.255922512437 -0.19555757852 -0.078139230643
## 9 -0.058634130255 -0.00097436326 0.061931834650
## 10 0.112906670168 0.09012257692 -0.175119602344
## 11 -0.016941490560 -0.00586716282 0.024007620209
## 12 0.255333992291 0.01223294440 -0.223346830269
## 13 0.030998014268 -0.00544187662 -0.029440465122
## 14 0.053164256377 -0.06429684990 -0.006768739355
## 15 -0.014946612493 0.03071479694 -0.009779582873
## 16 0.000360387630 0.00343284009 -0.004981793806
## 17 -0.112613855746 0.05239381370 0.037839909024
## 18 -0.003599410253 0.00339544599 -0.002113256173
## 19 -0.011926571206 0.01100515978 0.009600910446
## 20 0.012631725216 0.05684762273 -0.079684395513
## 21 0.022686408006 -0.01231145555 -0.015610529195
## 22 -0.008771012153 0.01714970538 -0.002285155519
## 23 0.018112691150 -0.01248554173 -0.011132359675
## 24 0.009036507682 -0.00905489848 0.000831930266
## 25 -0.213515217850 -0.07181156704 0.246384956629
## 26 0.000451307636 0.00097417310 0.000162637677
## 27 0.028369047265 -0.01898303230 -0.003224749471
## 28 -0.133567593869 0.11905604348 -0.010172803544
## 29 -0.056848177058 0.00549026578 0.053941197131
## 30 -0.192788383110 0.16140191952 0.014503477677
## 31 -0.162286380981 0.19283481989 -0.042678748178
## 32 -0.098002046581 0.06027648241 0.052162133522
## 33 0.037272643594 0.04197469001 -0.060926446243
## 34 0.028513772647 0.03407714367 -0.049809554040
## 35 -0.061588138511 0.01762026419 0.030365919652
## 36 0.033091192493 0.07980893979 -0.085733347944
## 37 -0.009638195110 0.00345073067 0.007835596995
## 38 0.057955594028 0.08969881920 -0.116570310627
## 39 0.002946598704 0.09986450545 -0.061278178027
## 40 -0.012226878355 0.02483370516 -0.009317290388
## 41 0.162986187079 -0.31340725274 0.153935895569
## 42 -0.005360922488 -0.00815525831 0.016468335255
## 43 0.032288644857 0.05117082204 -0.067109990820
## 44 -0.000766405342 -0.00308752161 0.007039491261
## 45 -0.092585782644 0.11646341023 -0.035182741255
## 46 0.005463439356 -0.04010155231 0.026916764420
## 47 0.096660256054 -0.08788228651 0.001663079369
## 48 0.054429989486 -0.07904831823 0.000396581855
## 49 -0.006569587956 0.01418019917 -0.005066950567
## 50 -0.061024895677 0.00923869812 0.055216497675
## 51 0.016753926465 0.05911783043 -0.084819260959
## 52 -0.001186441845 0.00191028331 0.001204742684
## 53 0.021474620583 -0.00664373061 -0.016072382937
## 54 -0.067908201999 0.04119115726 0.008771075829
## 55 0.016029755495 0.07463896395 -0.077845707811
## 56 0.063836197659 -0.04801193613 -0.029359957357
## 57 0.033410597220 -0.10192270110 0.076622702875
## 58 -0.076132685419 0.05755377187 0.009256433786
## 59 -0.095607623248 0.06793300935 0.014818970392
## 60 0.001112386644 -0.00355178790 0.002645084025
## 61 0.017234315033 -0.01562455709 0.000576379278
## 62 0.061994074738 -0.08824330206 0.043399366681
## 63 0.033094421993 -0.01449577035 -0.023893811372
## 64 0.044447973137 -0.08566009352 0.029683795943
## 65 -0.001410315782 0.00097309819 0.000017786799
## 66 -0.004189452543 0.00787756366 -0.010600091317
## 67 0.021289543242 -0.08492955027 0.066998432243
## 68 0.053932984324 -0.09282184280 0.025930109927
## 69 -0.076928203520 0.00569594557 0.075381069059
## 70 0.042413092427 0.18816520555 -0.194731253903
## 71 -0.004054180346 -0.03188817458 0.046567188773
## 72 -0.130232695729 -0.01804946666 0.154356870522
## 73 0.043948613203 0.03597546788 -0.064261667023
## 74 -0.050464994189 -0.09715751257 0.126058220481
## 75 -0.050977927001 0.03695148081 0.001955138176
## 76 -0.028712171480 0.01137208943 0.008058670268
## 77 -0.019709074240 -0.02131245859 0.041433212191
## 78 0.073186850829 -0.03786431347 -0.046028125352
## 79 0.032179612063 0.02982407592 -0.047491410137
## 80 -0.353886590719 -0.03099731105 0.317886122630
## 81 -0.068331695420 0.02863104236 0.028663695119
## 82 0.120443387378 -0.07791910675 -0.059694353028
## 83 0.055353297784 0.10087459385 -0.116131148905
## 84 0.063138078626 0.01575495522 -0.067808130657
## 85 -0.183651177397 0.01654409803 0.130914257492
## 86 -0.016334668896 -0.02408264555 0.044435229192
## 87 -0.030529665533 -0.07974750625 0.121328185437
## 88 -0.075295721835 0.07355563189 -0.006220012040
## 89 -0.096579059633 -0.03016671977 0.133247625867
## 90 0.032663545940 -0.04128668961 0.011848636155
## 91 0.028548620914 -0.01402493385 -0.020731860338
## 92 -0.025344201753 -0.02941972040 0.061441363755
## 93 0.033043627087 -0.06425625221 0.006977690341
## 94 0.004645520153 -0.00396566076 -0.002210531525
## 95 0.022859051378 -0.03529573977 0.024412135463
## 96 -0.126807146682 0.08516506459 0.024834712465
## 97 0.072964732615 -0.08584732989 0.021660648476
## 98 -0.008365148117 0.04800353601 -0.045745327588
## 99 -0.002641721840 -0.00049715746 0.002338582974
## 100 0.065267948959 -0.03198542749 -0.037426350785
## 101 0.062500469216 -0.15395946832 0.070722399597
## 102 0.156829748031 0.04091117047 -0.160755980330
## 103 0.299437549864 -0.31838233928 0.035749448605
## 104 -0.124347321368 0.24509913346 -0.140164427681
## 105 0.008546909865 -0.01254293919 0.002015138312
## 106 -0.031286704435 -0.02389841704 0.058269988093
## 107 0.119737722210 -0.17946434469 0.069241631130
## 108 -0.045764635344 -0.06809643151 0.120173727300
## 109 0.085473352306 0.03567188680 -0.097137627322
## 110 0.140126231054 0.14285594740 -0.244102278544
## 111 -0.002481724099 0.00438996420 -0.002908468991
## 112 0.038870322361 -0.02728932847 -0.017742872075
## 113 -0.008267239360 0.04963114898 -0.031952058775
## 114 -0.092164400137 -0.09356999324 0.182794231945
## 115 0.009232892368 -0.00224637858 -0.009138650283
## 116 0.017865742417 -0.00682054333 -0.014938551521
## 117 0.028423405268 0.00790349153 -0.030948229836
## 118 0.000333272276 -0.01116828060 0.016755186812
## 119 -0.000858025557 0.00075517280 -0.000114813571
## 120 0.022286978627 -0.08299919049 0.043012406356
## 121 0.150670594068 -0.00812000461 -0.119172335486
## 122 -0.003351579787 0.00252907697 0.002561700538
## 123 -0.002397878718 -0.00673460640 0.006496253519
## 124 -0.301826331423 0.19187575909 0.066121280364
## 125 -0.125714771525 -0.02757701164 0.155829073567
## 126 -0.064625240520 0.21423633983 -0.113175527923
## 127 0.074617700945 -0.11565135494 0.051089741523
## 128 0.023004112699 0.05978531421 -0.102885662706
## 129 0.133272019176 -0.19314668207 0.063730240773
## 130 0.001101945803 -0.00068974225 -0.000516371963
## 131 -0.002092429732 0.00325896106 0.001633282589
## 132 0.100307040926 -0.21677065818 0.119341400793
## 133 -0.000144697790 -0.00036069802 0.000551972656
## 134 -0.059851442880 0.12877896663 -0.089913071028
## 135 -0.042511167076 0.06035385879 -0.003159759231
## 136 -0.065886531728 0.05583702827 0.024217789120
## 137 0.033199570168 -0.01144174145 -0.013793346706
## 138 -0.006599957966 -0.00350138251 0.013935857127
## 139 -0.019951215587 -0.00310015122 0.015464292577
## 140 -0.154450257911 -0.03124613689 0.143757440940
## 141 -0.002728444107 -0.00491697305 0.006265697508
## 142 0.006469801954 0.01129090102 -0.018218931225
## 143 -0.000008267508 0.00001663232 -0.000001248269
## 144 0.001039279806 0.01221813525 -0.009016784245
## 145 -0.002897493214 0.02387998606 -0.022056498043
## 146 -0.048425756230 0.02907068944 0.011409314113
## 147 0.008897551283 -0.09130856796 0.058823304380
## 148 -0.002472852271 -0.00686690246 0.022192728650
## 149 0.014343387070 -0.00369890653 -0.012465194125
## 150 -0.001697638784 0.00136382645 -0.000383557430
## 151 0.154235129668 -0.00228857465 -0.121798014063
## 152 -0.081052237193 -0.00609940537 0.054771347537
## 153 0.084913770453 0.00633215454 -0.076297657888
Debemos ver qué observación tiene más influencia en términos de desviaciones estándar.
Vamos a incluir la variable si el país es “desarrollado” o “en vías de desarrollo”.
Primero la vemos
table(base_modif$status)
##
## Developed Developing
## 28 125
Le asignaremos un 1 si es desarrollado y 0 si es en vías de desarrollo.
base_modif <- base_modif %>%
mutate(estado=case_when(status=="Developed"~1,
status=="Developing"~0))
La vemos de nuevo
table(base_modif$estado)
##
## 0 1
## 125 28
class(base_modif$estado)
## [1] "numeric"
Le asignamos etiquetas
base_modif <- base_modif %>%
mutate(estado=factor(estado, levels = c("0", "1"),
labels =c("En desarrollo", "Desarrollados")))
Ahora la incluimos en el modelo y quitamos el PIB porque seguramente están estrechamente relacionadas.
modelo6 <- lm(life_expectancy~schooling+estado, data=base_modif)
summary(modelo6)
##
## Call:
## lm(formula = life_expectancy ~ schooling + estado, data = base_modif)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.4686 -3.2238 0.3234 3.4793 8.9721
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 43.3782 1.9530 22.211 <0.0000000000000002 ***
## schooling 2.1483 0.1555 13.818 <0.0000000000000002 ***
## estadoDesarrollados 1.9292 1.1797 1.635 0.104
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.64 on 150 degrees of freedom
## Multiple R-squared: 0.6838, Adjusted R-squared: 0.6796
## F-statistic: 162.2 on 2 and 150 DF, p-value: < 0.00000000000000022
Graficamos
coef_graf<- dwplot(list(modelo6), show_intercept = FALSE,
vline = geom_vline(xintercept = 0, colour = "grey60", linetype = 2)) %>%
relabel_predictors(c(schooling = "Años promedio de escolaridad",
estadoDesarrollados= "Nivel de desarrollo")) +
theme_bw() +
theme(legend.position = "none")+
xlab("Coeficientes estimados") +
ylab("Esperanza de vida") +
ggtitle("Modelo de regresión lineal múltiple")
coef_graf
Vemos que el nivel de desarrollo no tiene un efecto significativo. Podemos volver a utilizar el PIB. O bien, utilizar el “Income composition of resources” (Human Development Index in terms of income composition of resources (index ranging from 0 to 1)) para ver qué pasa.
Vamos a convertir esa variable en una categórica (para practicar).
La vemos
summary(base_modif$income_composition_of_resources)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.3470 0.5650 0.7340 0.6969 0.8230 0.9480
Construimos 5 categorías.
base_modif <- base_modif %>%
mutate(recursos=case_when(income_composition_of_resources>=0 & income_composition_of_resources<=0.3470~1,
income_composition_of_resources>=0.3471 & income_composition_of_resources<=0.5650~2,
income_composition_of_resources>=0.5651 & income_composition_of_resources<=0.7340~3,
income_composition_of_resources>=0.7341 & income_composition_of_resources<=0.8230~4,
income_composition_of_resources>=0.8231 & income_composition_of_resources<=0.9480~5))
La vemos
table(base_modif$recursos)
##
## 1 2 3 4 5
## 1 38 39 38 37
Ponemos etiquetas
base_modif <- base_modif %>%
mutate(recursos=factor(recursos, levels = c("1", "2", "3", "4", "5"),
labels =c("Muy Bajo", "Bajo", "Medio", "Alto", "Muy alto")))
Estimamos el modelo
modelo7 <- lm(life_expectancy~schooling+recursos, data=base_modif)
summary(modelo7)
##
## Call:
## lm(formula = life_expectancy ~ schooling + recursos, data = base_modif)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.5353 -2.1449 0.1014 2.6013 8.0857
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.5397 4.0804 11.896 < 0.0000000000000002 ***
## schooling 0.5578 0.2186 2.552 0.011744 *
## recursosBajo 7.6966 3.8597 1.994 0.047988 *
## recursosMedio 14.7907 3.9871 3.710 0.000294 ***
## recursosAlto 18.8403 4.1272 4.565 0.000010476 ***
## recursosMuy alto 23.2019 4.3406 5.345 0.000000337 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.774 on 147 degrees of freedom
## Multiple R-squared: 0.795, Adjusted R-squared: 0.788
## F-statistic: 114 on 5 and 147 DF, p-value: < 0.00000000000000022
lmbeta <- lm.beta(modelo7)
summary(lmbeta)
##
## Call:
## lm(formula = life_expectancy ~ schooling + recursos, data = base_modif)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.5353 -2.1449 0.1014 2.6013 8.0857
##
## Coefficients:
## Estimate Standardized Std. Error t value Pr(>|t|)
## (Intercept) 48.5397 NA 4.0804 11.896 < 0.0000000000000002
## schooling 0.5578 0.2003 0.2186 2.552 0.011744
## recursosBajo 7.6966 0.4071 3.8597 1.994 0.047988
## recursosMedio 14.7907 0.7890 3.9871 3.710 0.000294
## recursosAlto 18.8403 0.9964 4.1272 4.565 0.000010476
## recursosMuy alto 23.2019 1.2161 4.3406 5.345 0.000000337
##
## (Intercept) ***
## schooling *
## recursosBajo *
## recursosMedio ***
## recursosAlto ***
## recursosMuy alto ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.774 on 147 degrees of freedom
## Multiple R-squared: 0.795, Adjusted R-squared: 0.788
## F-statistic: 114 on 5 and 147 DF, p-value: < 0.00000000000000022
confint(lmbeta)
## 2.5 % 97.5 %
## (Intercept) NA NA
## schooling -0.231683 0.6323254
## recursosBajo -7.220521 8.0346458
## recursosMedio -7.090391 8.6684482
## recursosAlto -7.159802 9.1526656
## recursosMuy alto -7.361937 9.7941562
Hacemos una dummy de recursos=1 si es arriba de la media y 0 si es menor.
base_modif <- base_modif %>%
mutate(recursos_2 = ifelse(income_composition_of_resources <= 0.6969,1,0))
Estimamos el modelo
modelo8 <- lm(life_expectancy~schooling*recursos_2, data=base_modif)
summary(modelo8)
##
## Call:
## lm(formula = life_expectancy ~ schooling * recursos_2, data = base_modif)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.4507 -2.4925 0.4447 2.9703 9.0526
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 54.97047 3.95992 13.882 < 0.0000000000000002 ***
## schooling 1.48171 0.26159 5.664 0.0000000738 ***
## recursos_2 -6.40924 5.01719 -1.277 0.203
## schooling:recursos_2 0.03491 0.39334 0.089 0.929
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.328 on 149 degrees of freedom
## Multiple R-squared: 0.7267, Adjusted R-squared: 0.7212
## F-statistic: 132 on 3 and 149 DF, p-value: < 0.00000000000000022
Graficamos
plot_model(modelo8, type = "pred", terms = c("schooling", "recursos_2"))