Análisis de regresión simple

https://bookdown.org/rwnahhas/RMPH/slr-model-with-a-continuous-predictor.html

El modelo de regresión simple tiene como objetivo analizar la asociación entre una variable de respuesta o dependiente (cuantitativa, por ahora) y una variable explicativa o independiente.

Por ejemplo podríamos preguntarnos si el pib de los países tiene un efecto en la violencia.

Vamos a denotar a “y” como la variable respuesta y a “x” como la variable explicativa. Queremos saber cómo los valores de “y” (en promedio) tienden a cambiar dados los cambios en la variable “x” (en promedio).

La fórmula y= alfa+betax, nos dice que “y” es una función lineal de “x” y que vamos a tener un intercepto y una pendiente. En el contexto de la regresión, estos serán nuestros coeficientes.

Es importante que antes de efectuar el análisis de regresión, veamos un diagrama de dispersión para evaluar si existe relación lineal entre las variables.

Una vez que analizamos la posibilidad de estimar el modelo, vamos a tener nuestra ecuación estimada. La ecuación predicha o estimada, es la mejor línea o el mejor ajuste que minimza el error. La difrencia entre el valor observado y el predicho, es el residuo.

b = sumatoria(x-xbarra)(y-ybarra)/sumatoria(x-xbarra)^2 a = ybarra-bxbarra

La SSE (suma de los residuos cuadrados): sumatoria(yobservada-ypredicha)^2 La TSS(suma total): sumatoria(yobservada-ybarra)^2 La MSE (suma del modelo): TSE-SSE

Coeficiente de determinación: TSS-SSE/TSS -El coeficiente de determinación va entre 0 y 1

Cuando la regresión tiene p parámetros desconocidos, los grados de libertad serán df=n-p.

Paquetería

library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ ggplot2 3.4.1      ✔ purrr   0.3.5 
## ✔ tibble  3.1.8      ✔ dplyr   1.0.10
## ✔ tidyr   1.2.1      ✔ stringr 1.4.1 
## ✔ readr   2.1.2      ✔ forcats 0.5.2 
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(janitor)
## 
## Attaching package: 'janitor'
## 
## The following objects are masked from 'package:stats':
## 
##     chisq.test, fisher.test
library(knitr)
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## 
## The following object is masked from 'package:dplyr':
## 
##     group_rows
library(broom)
library(jtools)
library(dotwhisker)
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(lm.beta)
library(sjPlot)
## Learn more about sjPlot with 'browseVignettes("sjPlot")'.
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## 
## The following object is masked from 'package:dplyr':
## 
##     recode
## 
## The following object is masked from 'package:purrr':
## 
##     some

Lectura de datos

base <- read.csv("C:/Users/molar/Dropbox/2023_Trabajos/FLACSO/Datos/life_expec_who.csv")

Vamos a limpiar los nombres

base<- clean_names(base)

Quitar notación científica

options(scipen = 999)

Filtrar la base

base_modif <- base %>% 
  filter(year==2015) %>% 
  filter(!is.na(life_expectancy)) %>% 
  filter(!is.na(schooling)) %>% 
  filter(!is.na(gdp)) 

El material de la base de datos, lo pueden consultar aquí: https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who?resource=download

Vamos a hacer un análisis de regresión para explicar la esperanza de vida a partir de la escolaridad.

Primero podemos ver cómo se distribuyen ambas variables.

base %>% 
  ggplot(aes(x=schooling))+
  geom_histogram()+
  theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 163 rows containing non-finite values (`stat_bin()`).

base %>% 
  ggplot(aes(x=life_expectancy))+
  geom_histogram()+
  theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 10 rows containing non-finite values (`stat_bin()`).

Luego hacemos un diagrama de dispersión básico

base %>% 
  ggplot(aes(x=schooling, y=life_expectancy))+
  geom_point()+
  geom_smooth(method="lm")+
  theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
## Warning: Removed 170 rows containing non-finite values (`stat_smooth()`).
## Warning: Removed 170 rows containing missing values (`geom_point()`).

Vemos que a lo mejor sí conviene estimar el modelo de regresión.

Podemos calcular la ecuación estimada a mano.

base$var1 <- base$schooling-mean(base$schooling, na.rm=T)
base$var2 <- base$life_expectancy-mean(base$life_expectancy, na.rm=T)
base$var3 <- base$var1*base$var2

Entonces el numerador sería

base %>% 
  summarise(suma=sum(var3, na.rm=T))
##       suma
## 1 65181.43

Y el denominador

base$var4 <- base$var1*base$var1

base %>% 
  summarise(suma=sum(var4, na.rm=T))
##       suma
## 1 31297.22

Entonces b es

b <- 65181.43/31297.22

Intercepto

a <- mean(base$life_expectancy, na.rm=T)-(2.08*mean(base$schooling, na.rm=T))

Forma rápida

modelo <- lm(life_expectancy~schooling, data=base_modif)
summary(modelo)
## 
## Call:
## lm(formula = life_expectancy ~ schooling, data = base_modif)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5807  -2.8686   0.3736   3.3040  10.3697 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  41.8409     1.7214   24.31 <0.0000000000000002 ***
## schooling     2.2930     0.1286   17.84 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.665 on 151 degrees of freedom
## Multiple R-squared:  0.6781, Adjusted R-squared:  0.676 
## F-statistic: 318.1 on 1 and 151 DF,  p-value: < 0.00000000000000022

El intervalo de confianza

confint(modelo)
##                2.5 %    97.5 %
## (Intercept) 38.43974 45.242107
## schooling    2.03897  2.546961

Tabla anova

anova <- aov(life_expectancy ~ schooling, data = base)

summary(anova)
##               Df Sum Sq Mean Sq F value              Pr(>F)    
## schooling      1 137101  137101    3599 <0.0000000000000002 ***
## Residuals   2766 105355      38                                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 170 observations deleted due to missingness

La columna DF muestra los grados de libertad para la variable independiente y los grados de libertad para los residuales.

Sum Sq muestra la variación total con respecto a la media global.

Mean Sq es la división de Sum Sq entre los grados de libertad.

F es el estadístico de preuba del modelo.

Pr(>F) es el pvalor de la prueba F.

Para hacer una tabla.

Guardamos la información del modelo.

modelo.info <- summary(modelo)

Extraemos los coeficientes

modelo.Coef <- as.data.frame(modelo.info$coefficients[, 1:2])

Calculamos los intervalos de confianza

modelo.Coef <- within(modelo.Coef, {
  Lower <- Estimate - qt(p = 0.95, df = modelo.info$df[2]) * `Std. Error`
  Upper <- Estimate + qt(p = 0.95, df = modelo.info$df[2]) * `Std. Error`
  Variable <- row.names(modelo.Coef)
  })

Vemos la tabla

modelo.Coef
##              Estimate Std. Error    Variable     Upper     Lower
## (Intercept) 41.840924  1.7214220 (Intercept) 44.689890 38.991958
## schooling    2.292966  0.1285533   schooling  2.505722  2.080209

Graficamos

ggplot(data = modelo.Coef, mapping = aes(x = Estimate, y = Variable)) +
  geom_point() +
  geom_errorbarh(aes(xmin = Lower, xmax = Upper), height = .4) +
  ggtitle("Coeficientes del modelo de regresión") +
  theme(plot.title = element_text(hjust = 0.5))

Otra forma: https://cran.r-project.org/web/packages/jtools/vignettes/summ.html#plot_summs()_and_plot_coefs()

https://www.danieldsjoberg.com/gtsummary/articles/tbl_regression.html

Datos: https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who?resource=download Más datos: https://vincentarelbundock.github.io/Rdatasets/articles/data.html Dataworld

Presentación de resultados en tabla para markdown: https://zief0002.github.io/book-8252/pretty-printing-tables-in-markdown.html

lm(life_expectancy ~ schooling, data = base) %>%
  tidy() %>%
  kable(
    caption = "Modelo de regresión lineal simple",
    col.names = c("Variable", "Coeficiente", "SE", "t", "p"),
    digits = c(0, 2, 3, 2, 3)
  )
Modelo de regresión lineal simple
Variable Coeficiente SE t p
(Intercept) 44.11 0.437 100.99 0
schooling 2.10 0.035 60.00 0

Análisis de regresión lineal múltiple

Ahora vamos a explicar la esperanza de vida en términos de la escolaridad y el gdp

base_modif <- base %>% 
  filter(year==2015) %>% 
  filter(!is.na(life_expectancy)) %>% 
  filter(!is.na(schooling)) %>% 
  filter(!is.na(gdp)) 

Lo vemos

modelo2 <- lm(life_expectancy~schooling+gdp, data=base_modif)
summary(modelo2)
## 
## Call:
## lm(formula = life_expectancy ~ schooling + gdp, data = base_modif)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.5386  -2.8642   0.6052   3.0684  10.8943 
## 
## Coefficients:
##                Estimate  Std. Error t value            Pr(>|t|)    
## (Intercept) 43.03414650  1.79643351  23.955 <0.0000000000000002 ***
## schooling    2.16048601  0.14216791  15.197 <0.0000000000000002 ***
## gdp          0.00007437  0.00003570   2.083              0.0389 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.615 on 150 degrees of freedom
## Multiple R-squared:  0.6872, Adjusted R-squared:  0.683 
## F-statistic: 164.8 on 2 and 150 DF,  p-value: < 0.00000000000000022

Intervalos de confianza

confint(modelo2, level=0.95) # intervalos de confianza
##                       2.5 %        97.5 %
## (Intercept) 39.484563948421 46.5837290523
## schooling    1.879575683472  2.4413963440
## gdp          0.000003837348  0.0001449069

Anova

anova(modelo2) # anova
## Analysis of Variance Table
## 
## Response: life_expectancy
##            Df Sum Sq Mean Sq  F value               Pr(>F)    
## schooling   1 6924.5  6924.5 325.1862 < 0.0000000000000002 ***
## gdp         1   92.4    92.4   4.3406              0.03891 *  
## Residuals 150 3194.1    21.3                                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Graficar coeficientes Con el paquete dotwhisker https://cran.r-project.org/web/packages/dotwhisker/vignettes/dotwhisker-vignette.html

coef_graf<- dwplot(list(modelo2), show_intercept = TRUE, 
                   vline = geom_vline(xintercept = 0, colour = "grey60", linetype = 2)) %>% 
  relabel_predictors(c(schooling = "Años promedio de escolaridad", gdp = "PIB ")) +
  theme_bw() + 
  xlab("Coeficientes estimados") + 
  ylab("Esperanza de vida") +
  geom_vline(xintercept = 0, colour = "grey60", linetype = 2) +
  ggtitle("Modelo de regresión lineal múltiple") 
coef_graf

Con el paquete jtools https://cran.r-project.org/web/packages/jtools/vignettes/summ.html#plot_summs_and_plot_coefs

plot_coefs(modelo2, scale = TRUE)
## Loading required namespace: broom.mixed

Verificación de supuestos

  1. Linealidad: existe una relación lineal entre x y y

Todo cambio en x, representa un cambio en y (manteniendo constante lo demás) La linealidad la podemos ver comparando los residuales con valores de y

modelo.metrics <- augment(modelo2)
modelo.metrics %>%
  ggplot(aes(.fitted, .resid)) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula = 'y ~ x'

Lo ideal es que la línea recta esté aproximada a cero

(En el resultado de la regresión también podemos hablar de la linealidad)

  1. Normalidad de los errores

Los residuos siguen una distribución normal.

Graficamos

modelo.metrics %>%
  ggplot(aes(.resid)) +
  geom_density() 

Hacemos la prueba de hipótesis

shapiro.test(modelo.metrics$.resid)
## 
##  Shapiro-Wilk normality test
## 
## data:  modelo.metrics$.resid
## W = 0.97727, p-value = 0.01235
  1. Homocedasticidad

Varianza constante.

modelo.metrics %>%
  ggplot(aes(.fitted, .std.resid)) +
  geom_point() 

bptest(modelo2)
## 
##  studentized Breusch-Pagan test
## 
## data:  modelo2
## BP = 5.6967, df = 2, p-value = 0.05794
  1. Independencia

El valor de u (factores distintos a x que afectan y), no depende del valor de x

modelo.metrics %>%
  ggplot(aes(schooling, .resid)) +
  geom_point() 

modelo.metrics %>%
  ggplot(aes(gdp, .resid)) +
  geom_point() 

Correlación

cor.test(modelo.metrics$.resid, modelo.metrics$schooling, method= "pearson", use= "pairwise.complete.obs")
## 
##  Pearson's product-moment correlation
## 
## data:  modelo.metrics$.resid and modelo.metrics$schooling
## t = -0.0000000000000069822, df = 151, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1586781  0.1586781
## sample estimates:
##                       cor 
## -0.0000000000000005682042
cor.test(modelo.metrics$.resid, modelo.metrics$gdp, method= "pearson", use= "pairwise.complete.obs")
## 
##  Pearson's product-moment correlation
## 
## data:  modelo.metrics$.resid and modelo.metrics$gdp
## t = -0.0000000000000047928, df = 151, p-value = 1
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.1586781  0.1586781
## sample estimates:
##                       cor 
## -0.0000000000000003900353

El valor esperado de los errores tiende a 0.

mean(modelo.metrics$.resid)
## [1] 0.00000000000002800383

Prueba Durbin Watson

dwtest(modelo2) 
## 
##  Durbin-Watson test
## 
## data:  modelo2
## DW = 2.1944, p-value = 0.8879
## alternative hypothesis: true autocorrelation is greater than 0
  1. Multicolinealidad
vif(modelo2)
## schooling       gdp 
##  1.250086  1.250086

Dado que ninguno de los valores es grande (>10) no hay multicolinealidad.

Transformaciones logarítmicas

Podemos transformar nuestra variable de gdp a logaritmo.

base_modif$ln_gdp <- log(base_modif$gdp)

Veamos la variable

base_modif %>% 
  ggplot(aes(ln_gdp))+
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Estimamos un nuevo modelo

modelo3 <- lm(life_expectancy~schooling+ln_gdp, data=base_modif)
summary(modelo3)
## 
## Call:
## lm(formula = life_expectancy ~ schooling + ln_gdp, data = base_modif)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.9738  -2.7413   0.5747   3.3512  10.5633 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)  40.0891     2.0386  19.665 <0.0000000000000002 ***
## schooling     2.1588     0.1534  14.075 <0.0000000000000002 ***
## ln_gdp        0.4473     0.2822   1.585               0.115    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.642 on 150 degrees of freedom
## Multiple R-squared:  0.6834, Adjusted R-squared:  0.6792 
## F-statistic: 161.9 on 2 and 150 DF,  p-value: < 0.00000000000000022

La interpretación es distinta.

(Ver página 46 de Wooldridge)

base_modif$ln_life_expectancy <- log(base_modif$life_expectancy)
modelo4 <- lm(ln_life_expectancy~schooling+gdp, data=base_modif)
summary(modelo4)
## 
## Call:
## lm(formula = ln_life_expectancy ~ schooling + gdp, data = base_modif)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.252909 -0.039535  0.008504  0.047082  0.141750 
## 
## Coefficients:
##                 Estimate   Std. Error t value            Pr(>|t|)    
## (Intercept) 3.8532005334 0.0269468340 142.993 <0.0000000000000002 ***
## schooling   0.0311503842 0.0021325449  14.607 <0.0000000000000002 ***
## gdp         0.0000009473 0.0000005355   1.769              0.0789 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06922 on 150 degrees of freedom
## Multiple R-squared:  0.6663, Adjusted R-squared:  0.6619 
## F-statistic: 149.8 on 2 and 150 DF,  p-value: < 0.00000000000000022
modelo5 <- lm(ln_life_expectancy~schooling+ln_gdp, data=base_modif)
summary(modelo5)
## 
## Call:
## lm(formula = ln_life_expectancy ~ schooling + ln_gdp, data = base_modif)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.25900 -0.04063  0.00836  0.04917  0.13780 
## 
## Coefficients:
##             Estimate Std. Error t value            Pr(>|t|)    
## (Intercept) 3.813260   0.030486 125.082 <0.0000000000000002 ***
## schooling   0.030943   0.002294  13.490 <0.0000000000000002 ***
## ln_gdp      0.006317   0.004220   1.497               0.137    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06942 on 150 degrees of freedom
## Multiple R-squared:  0.6644, Adjusted R-squared:  0.6599 
## F-statistic: 148.5 on 2 and 150 DF,  p-value: < 0.00000000000000022

Datos atípicos

Seleccionamos el modelo 3 para hacer pruebas de datos atípicos

https://fhernanb.github.io/libro_regresion/diag2.html#distancia-de-cook

Es una medida de cómo influye la observación i-ésima sobre la estimación de
β al ser retirada del conjunto de datos.

Son puntos influyentes los que presenten D=4/n-p-2

cooks.distance(modelo3)
##                  1                  2                  3                  4 
## 0.0000152609620583 0.0013326131588175 0.0000609316199086 0.0464398360854721 
##                  5                  6                  7                  8 
## 0.0009762527936573 0.0097215380936186 0.0065259818829327 0.0335052876677183 
##                  9                 10                 11                 12 
## 0.0023847296763829 0.0115874043179398 0.0003295250254510 0.0283956174812847 
##                 13                 14                 15                 16 
## 0.0009079591788791 0.0051937729072484 0.0005541195928025 0.0000361252082947 
##                 17                 18                 19                 20 
## 0.0069729021194751 0.0000684607862395 0.0010140323528844 0.0053387289059752 
##                 21                 22                 23                 24 
## 0.0006913529835590 0.0004346606516754 0.0005934292142736 0.0000465422058382 
##                 25                 26                 27                 28 
## 0.0249665919632421 0.0000254393287296 0.0005926593807071 0.0134617144634662 
##                 29                 30                 31                 32 
## 0.0020665572627654 0.0170976281801327 0.0178299919467393 0.0078147736733491 
##                 33                 34                 35                 36 
## 0.0028005536882837 0.0014818508094967 0.0021421339891368 0.0063183730685041 
##                 37                 38                 39                 40 
## 0.0001000268799098 0.0086089180397428 0.0147206020195278 0.0002589993524672 
##                 41                 42                 43                 44 
## 0.0373586852844952 0.0002904666816929 0.0025507633505177 0.0001624287729098 
##                 45                 46                 47                 48 
## 0.0075188324429093 0.0007746180661369 0.0050712748655420 0.0080409604400490 
##                 49                 50                 51                 52 
## 0.0001059039160516 0.0024053443368847 0.0053475659826412 0.0000432169860648 
##                 53                 54                 55                 56 
## 0.0002641529937740 0.0041810822623516 0.0027597779327475 0.0045375180064077 
##                 57                 58                 59                 60 
## 0.0062769063585186 0.0028121144659491 0.0044574738547696 0.0000068369978452 
##                 61                 62                 63                 64 
## 0.0001750083802268 0.0074494528235076 0.0010550924312916 0.0031722374545900 
##                 65                 66                 67                 68 
## 0.0000023349573595 0.0006632079410057 0.0036903699561217 0.0040257124072187 
##                 69                 70                 71                 72 
## 0.0040516483180361 0.0183151825915511 0.0029864347492673 0.0137636921313313 
##                 73                 74                 75                 76 
## 0.0022937397940128 0.0065288150984576 0.0022437200995894 0.0009057531384611 
##                 77                 78                 79                 80 
## 0.0007584569458715 0.0045458373573546 0.0017448299838654 0.0568322526766040 
##                 81                 82                 83                 84 
## 0.0020429052990234 0.0113056205952214 0.0129325077505058 0.0020227755784420 
##                 85                 86                 87                 88 
## 0.0156345133556952 0.0013242416651289 0.0099592568698121 0.0033145935280630 
##                 89                 90                 91                 92 
## 0.0100812116552730 0.0008696869989883 0.0010463434475320 0.0027815394107485 
##                 93                 94                 95                 96 
## 0.0067109982318748 0.0000415882490700 0.0025362394495025 0.0074477998091452 
##                 97                 98                 99                100 
## 0.0042161269404682 0.0020989489485771 0.0000056296892965 0.0022579357229904 
##                101                102                103                104 
## 0.0094693970600864 0.0146409390793957 0.0488757965499172 0.0314332925570057 
##                105                106                107                108 
## 0.0000896679114399 0.0018713417496201 0.0144529689960953 0.0078584888366887 
##                109                110                111                112 
## 0.0055631654272090 0.0223341792830110 0.0000238600261144 0.0012806360118160 
##                113                114                115                116 
## 0.0011758431414322 0.0131395991882986 0.0001362200667794 0.0004358633943257 
##                117                118                119                120 
## 0.0004300606945287 0.0006561329311216 0.0000008066387638 0.0038809701243981 
##                121                122                123                124 
## 0.0091349610892256 0.0000459434117352 0.0000574346229891 0.0415389920823922 
##                125                126                127                128 
## 0.0123002796727313 0.0204249584297862 0.0071551348337803 0.0120429491149783 
##                129                130                131                132 
## 0.0148930632419581 0.0000007315601437 0.0000915140431844 0.0183701910003005 
##                133                134                135                136 
## 0.0000002016906090 0.0147034257643188 0.0034180985280767 0.0047835562398229 
##                137                138                139                140 
## 0.0007217589040399 0.0003283389421366 0.0004775688936504 0.0160996404471349 
##                141                142                143                144 
## 0.0000200043782110 0.0001597222340837 0.0000000005714558 0.0001583872100959 
##                145                146                147                148 
## 0.0003402350301434 0.0011980704735186 0.0056631870257149 0.0023469713045343 
##                149                150                151                152 
## 0.0001844799911948 0.0000066061084014 0.0110878958659723 0.0086517204174718 
##                153 
## 0.0032591139892038

Graficamos para ver quiénes son

library(car)
influenceIndexPlot(modelo3, vars="Cook")

Otra forma

cutoff <- 4 / (153-2-2)  # D=4/n-p-2
plot(modelo3, which=4, cook.levels=cutoff, las=1)
abline(h=cutoff, lty="dashed", col="dodgerblue2")

DFBETAS

Es una medida que indica cuánto cambia el coeficiente de regresión, en unidades de desviación estándar, si se omitiera la i -ésima observación.

dffits(modelo3)
##              1              2              3              4              5 
##  0.00674378340  0.06312795381  0.01347605107 -0.38781614544  0.05397283897 
##              6              7              8              9             10 
## -0.17099258535  0.13993171031 -0.31799643937  0.08437147412  0.18617712840 
##             11             12             13             14             15 
##  0.03134180031  0.29357176317 -0.05204717374 -0.12497013952  0.04064892154 
##             16             17             18             19             20 
## -0.01037603616 -0.14501638542 -0.01428471835  0.05504256543 -0.12673783020 
##             21             22             23             24             25 
## -0.04541717315  0.03600711282 -0.04207709499  0.01177710358 -0.27386718830 
##             26             27             28             29             30 
##  0.00870717716  0.04204958603 -0.20255480452  0.07853269572 -0.22683817650 
##             31             32             33             34             35 
## -0.23172487590  0.15332556355  0.09156615611  0.06651550948 -0.08005065197 
##             36             37             38             39             40 
##  0.13792267676  0.01726630714  0.16109438370  0.21318910838  0.02778336097 
##             41             42             43             44             45 
##  0.33594354368  0.02942980358  0.08730973312  0.02200572141 -0.15027386273 
##             46             47             48             49             50 
## -0.04806153741  0.12319739083 -0.15600359209  0.01776588241  0.08474260801 
##             51             52             53             54             55 
## -0.12679035227  0.01134910074 -0.02805921709 -0.11215861188  0.09071835982 
##             56             57             58             59             60 
## -0.11668472436  0.13730968158 -0.09166577657 -0.11553153795  0.00451380095 
##             61             62             63             64             65 
##  0.02283904349  0.15010394622 -0.05611483842 -0.09730899127 -0.00263784237 
##             66             67             68             69             70 
## -0.04449844761  0.10503830347 -0.10966465533  0.11005809211  0.23450145570 
##             71             72             73             74             75 
##  0.09466047904  0.20365363262  0.08278979981 -0.13965539309 -0.08196220627 
##             76             77             78             79             80 
## -0.05201361777  0.04755442747 -0.11672775448  0.07221574062 -0.42093108162 
##             81             82             83             84             85 
## -0.07812083555 -0.18474068698  0.19869653851  0.07766806794 -0.21791187463 
##             86             87             88             89             90 
##  0.06288208237  0.17359486110 -0.09952483661  0.17418593936  0.05092671514 
##             91             92             93             94             95 
## -0.05588987303  0.09125251892 -0.14249305511 -0.01113297400  0.08721641085 
##             96             97             98             99            100 
## -0.14955423819  0.11233185284 -0.07922558834 -0.00409593108 -0.08208344645 
##            101            102            103            104            105 
## -0.16834286355  0.21046762113  0.38511671945 -0.31172978923 -0.01634718801 
##            106            107            108            109            110 
##  0.07475550616  0.20856718109  0.15373859671  0.12919764578  0.25889764891 
##            111            112            113            114            115 
## -0.00843247316 -0.06182390067  0.05922482491  0.19845040703 -0.02015036687 
##            116            117            118            119            120 
## -0.03605360515  0.03580275139  0.04425616953 -0.00155041532 -0.10779817873 
##            121            122            123            124            125 
##  0.16542597334  0.01170155658 -0.01308339176 -0.36073848916  0.19219102109 
##            126            127            128            129            130 
##  0.24866632683  0.14660093817 -0.19195412499  0.21113784167 -0.00147649938 
##            131            132            133            134            135 
##  0.01651605343  0.23491167376  0.00077526653 -0.21227631554  0.10119574033 
##            136            137            138            139            140 
##  0.11980970376  0.04641114822  0.03129228293 -0.03774668659 -0.22181790497 
##            141            142            143            144            145 
## -0.00772101559 -0.02181834309  0.00004126667  0.02172890392 -0.03184709905 
##            146            147            148            149            150 
## -0.05980137500 -0.13047266439  0.08390691322 -0.02344952512 -0.00443695355 
##            151            152            153 
##  0.18283812739 -0.16240143500  0.09865180792
dfbetas(modelo3)
##         (Intercept)      schooling          ln_gdp
## 1    0.005660185140 -0.00274042992 -0.001964521560
## 2   -0.010361732846  0.01633517382  0.004713397560
## 3   -0.002765782233  0.00418252664  0.000837144858
## 4   -0.085103431156  0.24506607486 -0.193265030753
## 5   -0.022809947606 -0.01284730925  0.038796863681
## 6    0.111634350451 -0.09948501286 -0.029043805441
## 7    0.084639600543  0.05352970718 -0.112535396341
## 8    0.255922512437 -0.19555757852 -0.078139230643
## 9   -0.058634130255 -0.00097436326  0.061931834650
## 10   0.112906670168  0.09012257692 -0.175119602344
## 11  -0.016941490560 -0.00586716282  0.024007620209
## 12   0.255333992291  0.01223294440 -0.223346830269
## 13   0.030998014268 -0.00544187662 -0.029440465122
## 14   0.053164256377 -0.06429684990 -0.006768739355
## 15  -0.014946612493  0.03071479694 -0.009779582873
## 16   0.000360387630  0.00343284009 -0.004981793806
## 17  -0.112613855746  0.05239381370  0.037839909024
## 18  -0.003599410253  0.00339544599 -0.002113256173
## 19  -0.011926571206  0.01100515978  0.009600910446
## 20   0.012631725216  0.05684762273 -0.079684395513
## 21   0.022686408006 -0.01231145555 -0.015610529195
## 22  -0.008771012153  0.01714970538 -0.002285155519
## 23   0.018112691150 -0.01248554173 -0.011132359675
## 24   0.009036507682 -0.00905489848  0.000831930266
## 25  -0.213515217850 -0.07181156704  0.246384956629
## 26   0.000451307636  0.00097417310  0.000162637677
## 27   0.028369047265 -0.01898303230 -0.003224749471
## 28  -0.133567593869  0.11905604348 -0.010172803544
## 29  -0.056848177058  0.00549026578  0.053941197131
## 30  -0.192788383110  0.16140191952  0.014503477677
## 31  -0.162286380981  0.19283481989 -0.042678748178
## 32  -0.098002046581  0.06027648241  0.052162133522
## 33   0.037272643594  0.04197469001 -0.060926446243
## 34   0.028513772647  0.03407714367 -0.049809554040
## 35  -0.061588138511  0.01762026419  0.030365919652
## 36   0.033091192493  0.07980893979 -0.085733347944
## 37  -0.009638195110  0.00345073067  0.007835596995
## 38   0.057955594028  0.08969881920 -0.116570310627
## 39   0.002946598704  0.09986450545 -0.061278178027
## 40  -0.012226878355  0.02483370516 -0.009317290388
## 41   0.162986187079 -0.31340725274  0.153935895569
## 42  -0.005360922488 -0.00815525831  0.016468335255
## 43   0.032288644857  0.05117082204 -0.067109990820
## 44  -0.000766405342 -0.00308752161  0.007039491261
## 45  -0.092585782644  0.11646341023 -0.035182741255
## 46   0.005463439356 -0.04010155231  0.026916764420
## 47   0.096660256054 -0.08788228651  0.001663079369
## 48   0.054429989486 -0.07904831823  0.000396581855
## 49  -0.006569587956  0.01418019917 -0.005066950567
## 50  -0.061024895677  0.00923869812  0.055216497675
## 51   0.016753926465  0.05911783043 -0.084819260959
## 52  -0.001186441845  0.00191028331  0.001204742684
## 53   0.021474620583 -0.00664373061 -0.016072382937
## 54  -0.067908201999  0.04119115726  0.008771075829
## 55   0.016029755495  0.07463896395 -0.077845707811
## 56   0.063836197659 -0.04801193613 -0.029359957357
## 57   0.033410597220 -0.10192270110  0.076622702875
## 58  -0.076132685419  0.05755377187  0.009256433786
## 59  -0.095607623248  0.06793300935  0.014818970392
## 60   0.001112386644 -0.00355178790  0.002645084025
## 61   0.017234315033 -0.01562455709  0.000576379278
## 62   0.061994074738 -0.08824330206  0.043399366681
## 63   0.033094421993 -0.01449577035 -0.023893811372
## 64   0.044447973137 -0.08566009352  0.029683795943
## 65  -0.001410315782  0.00097309819  0.000017786799
## 66  -0.004189452543  0.00787756366 -0.010600091317
## 67   0.021289543242 -0.08492955027  0.066998432243
## 68   0.053932984324 -0.09282184280  0.025930109927
## 69  -0.076928203520  0.00569594557  0.075381069059
## 70   0.042413092427  0.18816520555 -0.194731253903
## 71  -0.004054180346 -0.03188817458  0.046567188773
## 72  -0.130232695729 -0.01804946666  0.154356870522
## 73   0.043948613203  0.03597546788 -0.064261667023
## 74  -0.050464994189 -0.09715751257  0.126058220481
## 75  -0.050977927001  0.03695148081  0.001955138176
## 76  -0.028712171480  0.01137208943  0.008058670268
## 77  -0.019709074240 -0.02131245859  0.041433212191
## 78   0.073186850829 -0.03786431347 -0.046028125352
## 79   0.032179612063  0.02982407592 -0.047491410137
## 80  -0.353886590719 -0.03099731105  0.317886122630
## 81  -0.068331695420  0.02863104236  0.028663695119
## 82   0.120443387378 -0.07791910675 -0.059694353028
## 83   0.055353297784  0.10087459385 -0.116131148905
## 84   0.063138078626  0.01575495522 -0.067808130657
## 85  -0.183651177397  0.01654409803  0.130914257492
## 86  -0.016334668896 -0.02408264555  0.044435229192
## 87  -0.030529665533 -0.07974750625  0.121328185437
## 88  -0.075295721835  0.07355563189 -0.006220012040
## 89  -0.096579059633 -0.03016671977  0.133247625867
## 90   0.032663545940 -0.04128668961  0.011848636155
## 91   0.028548620914 -0.01402493385 -0.020731860338
## 92  -0.025344201753 -0.02941972040  0.061441363755
## 93   0.033043627087 -0.06425625221  0.006977690341
## 94   0.004645520153 -0.00396566076 -0.002210531525
## 95   0.022859051378 -0.03529573977  0.024412135463
## 96  -0.126807146682  0.08516506459  0.024834712465
## 97   0.072964732615 -0.08584732989  0.021660648476
## 98  -0.008365148117  0.04800353601 -0.045745327588
## 99  -0.002641721840 -0.00049715746  0.002338582974
## 100  0.065267948959 -0.03198542749 -0.037426350785
## 101  0.062500469216 -0.15395946832  0.070722399597
## 102  0.156829748031  0.04091117047 -0.160755980330
## 103  0.299437549864 -0.31838233928  0.035749448605
## 104 -0.124347321368  0.24509913346 -0.140164427681
## 105  0.008546909865 -0.01254293919  0.002015138312
## 106 -0.031286704435 -0.02389841704  0.058269988093
## 107  0.119737722210 -0.17946434469  0.069241631130
## 108 -0.045764635344 -0.06809643151  0.120173727300
## 109  0.085473352306  0.03567188680 -0.097137627322
## 110  0.140126231054  0.14285594740 -0.244102278544
## 111 -0.002481724099  0.00438996420 -0.002908468991
## 112  0.038870322361 -0.02728932847 -0.017742872075
## 113 -0.008267239360  0.04963114898 -0.031952058775
## 114 -0.092164400137 -0.09356999324  0.182794231945
## 115  0.009232892368 -0.00224637858 -0.009138650283
## 116  0.017865742417 -0.00682054333 -0.014938551521
## 117  0.028423405268  0.00790349153 -0.030948229836
## 118  0.000333272276 -0.01116828060  0.016755186812
## 119 -0.000858025557  0.00075517280 -0.000114813571
## 120  0.022286978627 -0.08299919049  0.043012406356
## 121  0.150670594068 -0.00812000461 -0.119172335486
## 122 -0.003351579787  0.00252907697  0.002561700538
## 123 -0.002397878718 -0.00673460640  0.006496253519
## 124 -0.301826331423  0.19187575909  0.066121280364
## 125 -0.125714771525 -0.02757701164  0.155829073567
## 126 -0.064625240520  0.21423633983 -0.113175527923
## 127  0.074617700945 -0.11565135494  0.051089741523
## 128  0.023004112699  0.05978531421 -0.102885662706
## 129  0.133272019176 -0.19314668207  0.063730240773
## 130  0.001101945803 -0.00068974225 -0.000516371963
## 131 -0.002092429732  0.00325896106  0.001633282589
## 132  0.100307040926 -0.21677065818  0.119341400793
## 133 -0.000144697790 -0.00036069802  0.000551972656
## 134 -0.059851442880  0.12877896663 -0.089913071028
## 135 -0.042511167076  0.06035385879 -0.003159759231
## 136 -0.065886531728  0.05583702827  0.024217789120
## 137  0.033199570168 -0.01144174145 -0.013793346706
## 138 -0.006599957966 -0.00350138251  0.013935857127
## 139 -0.019951215587 -0.00310015122  0.015464292577
## 140 -0.154450257911 -0.03124613689  0.143757440940
## 141 -0.002728444107 -0.00491697305  0.006265697508
## 142  0.006469801954  0.01129090102 -0.018218931225
## 143 -0.000008267508  0.00001663232 -0.000001248269
## 144  0.001039279806  0.01221813525 -0.009016784245
## 145 -0.002897493214  0.02387998606 -0.022056498043
## 146 -0.048425756230  0.02907068944  0.011409314113
## 147  0.008897551283 -0.09130856796  0.058823304380
## 148 -0.002472852271 -0.00686690246  0.022192728650
## 149  0.014343387070 -0.00369890653 -0.012465194125
## 150 -0.001697638784  0.00136382645 -0.000383557430
## 151  0.154235129668 -0.00228857465 -0.121798014063
## 152 -0.081052237193 -0.00609940537  0.054771347537
## 153  0.084913770453  0.00633215454 -0.076297657888

Debemos ver qué observación tiene más influencia en términos de desviaciones estándar.

Variables explicativas categóricas

Vamos a incluir la variable si el país es “desarrollado” o “en vías de desarrollo”.

Primero la vemos

table(base_modif$status)
## 
##  Developed Developing 
##         28        125

Le asignaremos un 1 si es desarrollado y 0 si es en vías de desarrollo.

base_modif <- base_modif %>% 
  mutate(estado=case_when(status=="Developed"~1,
                          status=="Developing"~0))

La vemos de nuevo

table(base_modif$estado)
## 
##   0   1 
## 125  28
class(base_modif$estado)
## [1] "numeric"

Le asignamos etiquetas

base_modif <- base_modif %>%
  mutate(estado=factor(estado, levels = c("0", "1"), 
                           labels =c("En desarrollo", "Desarrollados")))

Ahora la incluimos en el modelo y quitamos el PIB porque seguramente están estrechamente relacionadas.

modelo6 <- lm(life_expectancy~schooling+estado, data=base_modif)
summary(modelo6)
## 
## Call:
## lm(formula = life_expectancy ~ schooling + estado, data = base_modif)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.4686  -3.2238   0.3234   3.4793   8.9721 
## 
## Coefficients:
##                     Estimate Std. Error t value            Pr(>|t|)    
## (Intercept)          43.3782     1.9530  22.211 <0.0000000000000002 ***
## schooling             2.1483     0.1555  13.818 <0.0000000000000002 ***
## estadoDesarrollados   1.9292     1.1797   1.635               0.104    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.64 on 150 degrees of freedom
## Multiple R-squared:  0.6838, Adjusted R-squared:  0.6796 
## F-statistic: 162.2 on 2 and 150 DF,  p-value: < 0.00000000000000022

Graficamos

coef_graf<- dwplot(list(modelo6), show_intercept = FALSE, 
                   vline = geom_vline(xintercept = 0, colour = "grey60", linetype = 2)) %>% 
  relabel_predictors(c(schooling = "Años promedio de escolaridad", 
                      estadoDesarrollados= "Nivel de desarrollo")) +
  theme_bw() + 
  theme(legend.position = "none")+
  xlab("Coeficientes estimados") + 
  ylab("Esperanza de vida") +
  ggtitle("Modelo de regresión lineal múltiple") 
coef_graf

Vemos que el nivel de desarrollo no tiene un efecto significativo. Podemos volver a utilizar el PIB. O bien, utilizar el “Income composition of resources” (Human Development Index in terms of income composition of resources (index ranging from 0 to 1)) para ver qué pasa.

Vamos a convertir esa variable en una categórica (para practicar).

La vemos

summary(base_modif$income_composition_of_resources)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3470  0.5650  0.7340  0.6969  0.8230  0.9480

Construimos 5 categorías.

base_modif <- base_modif %>% 
  mutate(recursos=case_when(income_composition_of_resources>=0 & income_composition_of_resources<=0.3470~1,
                          income_composition_of_resources>=0.3471 & income_composition_of_resources<=0.5650~2,
                          income_composition_of_resources>=0.5651 & income_composition_of_resources<=0.7340~3,
                          income_composition_of_resources>=0.7341 & income_composition_of_resources<=0.8230~4,
                          income_composition_of_resources>=0.8231 & income_composition_of_resources<=0.9480~5))

La vemos

table(base_modif$recursos)
## 
##  1  2  3  4  5 
##  1 38 39 38 37

Ponemos etiquetas

base_modif <- base_modif %>%
  mutate(recursos=factor(recursos, levels = c("1", "2", "3", "4", "5"), 
                           labels =c("Muy Bajo", "Bajo", "Medio", "Alto", "Muy alto")))

Estimamos el modelo

modelo7 <- lm(life_expectancy~schooling+recursos, data=base_modif)
summary(modelo7)
## 
## Call:
## lm(formula = life_expectancy ~ schooling + recursos, data = base_modif)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.5353  -2.1449   0.1014   2.6013   8.0857 
## 
## Coefficients:
##                  Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)       48.5397     4.0804  11.896 < 0.0000000000000002 ***
## schooling          0.5578     0.2186   2.552             0.011744 *  
## recursosBajo       7.6966     3.8597   1.994             0.047988 *  
## recursosMedio     14.7907     3.9871   3.710             0.000294 ***
## recursosAlto      18.8403     4.1272   4.565          0.000010476 ***
## recursosMuy alto  23.2019     4.3406   5.345          0.000000337 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.774 on 147 degrees of freedom
## Multiple R-squared:  0.795,  Adjusted R-squared:  0.788 
## F-statistic:   114 on 5 and 147 DF,  p-value: < 0.00000000000000022

Betas estandarizadas

lmbeta <- lm.beta(modelo7)
summary(lmbeta)
## 
## Call:
## lm(formula = life_expectancy ~ schooling + recursos, data = base_modif)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.5353  -2.1449   0.1014   2.6013   8.0857 
## 
## Coefficients:
##                  Estimate Standardized Std. Error t value             Pr(>|t|)
## (Intercept)       48.5397           NA     4.0804  11.896 < 0.0000000000000002
## schooling          0.5578       0.2003     0.2186   2.552             0.011744
## recursosBajo       7.6966       0.4071     3.8597   1.994             0.047988
## recursosMedio     14.7907       0.7890     3.9871   3.710             0.000294
## recursosAlto      18.8403       0.9964     4.1272   4.565          0.000010476
## recursosMuy alto  23.2019       1.2161     4.3406   5.345          0.000000337
##                     
## (Intercept)      ***
## schooling        *  
## recursosBajo     *  
## recursosMedio    ***
## recursosAlto     ***
## recursosMuy alto ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.774 on 147 degrees of freedom
## Multiple R-squared:  0.795,  Adjusted R-squared:  0.788 
## F-statistic:   114 on 5 and 147 DF,  p-value: < 0.00000000000000022
confint(lmbeta)
##                      2.5 %    97.5 %
## (Intercept)             NA        NA
## schooling        -0.231683 0.6323254
## recursosBajo     -7.220521 8.0346458
## recursosMedio    -7.090391 8.6684482
## recursosAlto     -7.159802 9.1526656
## recursosMuy alto -7.361937 9.7941562

Interacciones

Hacemos una dummy de recursos=1 si es arriba de la media y 0 si es menor.

base_modif <- base_modif %>% 
mutate(recursos_2 = ifelse(income_composition_of_resources <= 0.6969,1,0))

Estimamos el modelo

modelo8 <- lm(life_expectancy~schooling*recursos_2, data=base_modif)
summary(modelo8)
## 
## Call:
## lm(formula = life_expectancy ~ schooling * recursos_2, data = base_modif)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13.4507  -2.4925   0.4447   2.9703   9.0526 
## 
## Coefficients:
##                      Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)          54.97047    3.95992  13.882 < 0.0000000000000002 ***
## schooling             1.48171    0.26159   5.664         0.0000000738 ***
## recursos_2           -6.40924    5.01719  -1.277                0.203    
## schooling:recursos_2  0.03491    0.39334   0.089                0.929    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.328 on 149 degrees of freedom
## Multiple R-squared:  0.7267, Adjusted R-squared:  0.7212 
## F-statistic:   132 on 3 and 149 DF,  p-value: < 0.00000000000000022

Graficamos

plot_model(modelo8, type = "pred", terms = c("schooling", "recursos_2"))