Regresión lineal múltiple

library(rio)
data <- import("datos2016_v3.sav")

Se busca determinar el mejor modelo para determinar los principales predictores que permitan explicar el Voto a PPK # Análisis exploratorio

names(data)

##  [1] "Dpto"                        "region"                     
##  [3] "Analfabetismo_2014"          "pbi_2014"                   
##  [5] "pbipc_2014"                  "Prom_Años_Estudio_2014"     
##  [7] "Hogares_Benef_Prog_Sociales" "Victimas_CVR"               
##  [9] "log_Victimas_CVR"            "Uso_internet"               
## [11] "Inversión_Mineria_2014"      "Log_Inv_Min_2014"           
## [13] "Transf_Canon_Minero_2014"    "Log_Transf"                 
## [15] "Tasa_Delitos"                "Voto_KF_2016_1"             
## [17] "Voto_VM_2016_1"              "Voto_PPK_2016_1"            
## [19] "Voto_PPK_2016_2"             "Voto_KF_2016_2"             
## [21] "pbipc_grupos"                "pbix1000"                   
## [23] "pbixanalf"                   "costa"                      
## [25] "sierra"                      "selva"                      
## [27] "analfxcosta"                 "analfxsierra"               
## [29] "analfxselva"

Regresión lineal simple ### Modelo 1

Predecir uso de intenalfabetismo_2014rnet por región con voto a PPK

library(ggplot2)
ggplot(data, 
  aes(x=Uso_internet, y=Voto_PPK_2016_1))+
  geom_point() + 
  geom_smooth(method="lm", se = F)

## `geom_smooth()` using formula 'y ~ x'

cor.test(data$Voto_PPK_2016_1,data$Uso_internet)

## 
##  Pearson's product-moment correlation
## 
## data:  data$Voto_PPK_2016_1 and data$Uso_internet
## t = 4.0278, df = 23, p-value = 0.0005251
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.3324813 0.8278758
## sample estimates:
##       cor 
## 0.6431228

modelo1<-lm(Voto_PPK_2016_1~Uso_internet, data) #tabla anova que incluye coeficientes de la ecuación de la recta
summary(modelo1)

## 
## Call:
## lm(formula = Voto_PPK_2016_1 ~ Uso_internet, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.6001 -4.9871 -0.2939  3.2344 13.7498 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.51789    3.23825   1.086 0.288576    
## Uso_internet  0.37312    0.09264   4.028 0.000525 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.649 on 23 degrees of freedom
## Multiple R-squared:  0.4136, Adjusted R-squared:  0.3881 
## F-statistic: 16.22 on 1 and 23 DF,  p-value: 0.0005251

Resultados relevantes: p-value = 0.0005251 cor=0.6431228 Uso_internet 0.000525 ** Multiple R-squared: 0.4136 y=3.51789 + 0.37312x Observamos que mi p-value es menor a 0.05 así que rechazo mi hipóteis nula y compruebo mi hipótesis alternativa que me indica que sí hay correlación entre mis dos variables analizadas. De acuerdo al coeficiente de correlación, se observa que el grado de asociación es positivo y fuerte. Asimismo, se observa que la variable uso de internet sí aporta al modelo. Este modelo tiene una validez del 41%. Por lo tanto, el 41% de la variabilidad del porcentaje de voto a PPK será explicado por el porcentaje de uso de internet. Mi ecuación es la siguiente: y=3.51789 + 0.37312*x En ese sentido, cuando el porcentaje de uso de internet sea 0 el voto a PPK en primera vuelta será del 3.5%. Por cada punto que aumenta el uso de internet el voto a PPK aumenta en un 0.37.

ggplot(data, 
  aes(x=Analfabetismo_2014,y=Voto_PPK_2016_1))+
  geom_point() + 
  geom_smooth(method="lm", se = F)

## `geom_smooth()` using formula 'y ~ x'

cor.test(data$Voto_PPK_2016_1,data$Analfabetismo_2014)

## 
##  Pearson's product-moment correlation
## 
## data:  data$Voto_PPK_2016_1 and data$Analfabetismo_2014
## t = -4.5425, df = 23, p-value = 0.0001456
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.8514481 -0.4016906
## sample estimates:
##        cor 
## -0.6876737

modelo2<-lm(Voto_PPK_2016_1~Analfabetismo_2014, data) #tabla anova que incluye coeficientes de la ecuación de la recta
summary(modelo2)

## 
## Call:
## lm(formula = Voto_PPK_2016_1 ~ Analfabetismo_2014, data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.701  -3.535  -1.225   4.402   9.015 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         24.6354     2.2318  11.038 1.15e-10 ***
## Analfabetismo_2014  -1.1462     0.2523  -4.543 0.000146 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.356 on 23 degrees of freedom
## Multiple R-squared:  0.4729, Adjusted R-squared:   0.45 
## F-statistic: 20.63 on 1 and 23 DF,  p-value: 0.0001456

cor=–0.6876737 (Intercept) 11.038 Analfabetismo_2014 -4.543 Analfabetismo (p-value) 0.000146 p-value = 0.0001456 Multiple R-squared=0.4729

ggplot(data, 
  aes(x=pbipc_2014,y=Voto_PPK_2016_1))+
  geom_point() + 
  geom_smooth(method="lm", se = F)

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 1 rows containing non-finite values (stat_smooth).

## Warning: Removed 1 rows containing missing values (geom_point).

cor.test(data$Voto_PPK_2016_1,data$pbipc_2014)

## 
##  Pearson's product-moment correlation
## 
## data:  data$Voto_PPK_2016_1 and data$pbipc_2014
## t = 3.6191, df = 22, p-value = 0.00152
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.2753472 0.8137527
## sample estimates:
##       cor 
## 0.6108899

modelo3<-lm(Voto_PPK_2016_1~pbipc_2014, data) #tabla anova que incluye coeficientes de la ecuación de la recta
summary(modelo3)

## 
## Call:
## lm(formula = Voto_PPK_2016_1 ~ pbipc_2014, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -7.739 -3.634 -1.237  2.506 11.596 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)   
## (Intercept) 8.0704028  2.2830875   3.535  0.00186 **
## pbipc_2014  0.0004790  0.0001324   3.619  0.00152 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.584 on 22 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3732, Adjusted R-squared:  0.3447 
## F-statistic:  13.1 on 1 and 22 DF,  p-value: 0.00152

Resultados relevantes p-value = 0.00152 cor = 0.6108899 Intercepto=8.0704028 Pendiente=0.0004790 pbipc_2014 (p-value)=0.00152** Multiple R-squared: 0.3732

2. Comparación entre modelos

library(stargazer)

## 
## Please cite as:

##  Hlavac, Marek (2022). stargazer: Well-Formatted Regression and Summary Statistics Tables.

##  R package version 5.2.3. https://CRAN.R-project.org/package=stargazer

stargazer(modelo1,modelo2,modelo3, type ="text")

## 
## ========================================================================================
##                                             Dependent variable:                         
##                     --------------------------------------------------------------------
##                                               Voto_PPK_2016_1                           
##                              (1)                    (2)                    (3)          
## ----------------------------------------------------------------------------------------
## Uso_internet               0.373***                                                     
##                            (0.093)                                                      
##                                                                                         
## Analfabetismo_2014                               -1.146***                              
##                                                   (0.252)                               
##                                                                                         
## pbipc_2014                                                              0.0005***       
##                                                                          (0.0001)       
##                                                                                         
## Constant                    3.518                24.635***               8.070***       
##                            (3.238)                (2.232)                (2.283)        
##                                                                                         
## ----------------------------------------------------------------------------------------
## Observations                  25                     25                     24          
## R2                          0.414                  0.473                  0.373         
## Adjusted R2                 0.388                  0.450                  0.345         
## Residual Std. Error    5.649 (df = 23)        5.356 (df = 23)        5.584 (df = 22)    
## F Statistic         16.223*** (df = 1; 23) 20.635*** (df = 1; 23) 13.098*** (df = 1; 22)
## ========================================================================================
## Note:                                                        *p<0.1; **p<0.05; ***p<0.01

3. Regresion Líneal Múltiple

Modelo 4

#¿podemos graficar?
library(scatterplot3d)
(a=scatterplot3d(data[,c('Analfabetismo_2014','pbipc_2014','Uso_internet')]))

## $xyz.convert
## function (x, y = NULL, z = NULL) 
## {
##     xyz <- xyz.coords(x, y, z)
##     if (angle > 2) {
##         temp <- xyz$x
##         xyz$x <- xyz$y
##         xyz$y <- temp
##     }
##     y <- (xyz$y - y.add)/y.scal
##     return(list(x = xyz$x/x.scal + yx.f * y, y = xyz$z/z.scal + 
##         yz.f * y))
## }
## <bytecode: 0x000000002b3a8dd8>
## <environment: 0x000000002b3c5930>
## 
## $points3d
## function (x, y = NULL, z = NULL, type = "p", ...) 
## {
##     xyz <- xyz.coords(x, y, z)
##     if (angle > 2) {
##         temp <- xyz$x
##         xyz$x <- xyz$y
##         xyz$y <- temp
##     }
##     y2 <- (xyz$y - y.add)/y.scal
##     x <- xyz$x/x.scal + yx.f * y2
##     y <- xyz$z/z.scal + yz.f * y2
##     mem.par <- par(mar = mar, usr = usr)
##     if (type == "h") {
##         y2 <- z.min + yz.f * y2
##         segments(x, y, x, y2, ...)
##         points(x, y, type = "p", ...)
##     }
##     else points(x, y, type = type, ...)
## }
## <bytecode: 0x000000002b3ad5e8>
## <environment: 0x000000002b3c5930>
## 
## $plane3d
## function (Intercept, x.coef = NULL, y.coef = NULL, lty = "dashed", 
##     lty.box = NULL, draw_lines = TRUE, draw_polygon = FALSE, 
##     polygon_args = list(border = NA, col = rgb(0, 0, 0, 0.2)), 
##     ...) 
## {
##     if (!is.atomic(Intercept) && !is.null(coef(Intercept))) {
##         Intercept <- coef(Intercept)
##         if (!("(Intercept)" %in% names(Intercept))) 
##             Intercept <- c(0, Intercept)
##     }
##     if (is.null(lty.box)) 
##         lty.box <- lty
##     if (is.null(x.coef) && length(Intercept) == 3) {
##         x.coef <- Intercept[if (angle > 2) 
##             3
##         else 2]
##         y.coef <- Intercept[if (angle > 2) 
##             2
##         else 3]
##         Intercept <- Intercept[1]
##     }
##     mem.par <- par(mar = mar, usr = usr)
##     x <- x.min:x.max
##     y <- 0:y.max
##     ltya <- c(lty.box, rep(lty, length(x) - 2), lty.box)
##     x.coef <- x.coef * x.scal
##     z1 <- (Intercept + x * x.coef + y.add * y.coef)/z.scal
##     z2 <- (Intercept + x * x.coef + (y.max * y.scal + y.add) * 
##         y.coef)/z.scal
##     if (draw_polygon) 
##         do.call("polygon", c(list(c(x.min, x.min + y.max * yx.f, 
##             x.max + y.max * yx.f, x.max), c(z1[1], z2[1] + yz.f * 
##             y.max, z2[length(z2)] + yz.f * y.max, z1[length(z1)])), 
##             polygon_args))
##     if (draw_lines) 
##         segments(x, z1, x + y.max * yx.f, z2 + yz.f * y.max, 
##             lty = ltya, ...)
##     ltya <- c(lty.box, rep(lty, length(y) - 2), lty.box)
##     y.coef <- (y * y.scal + y.add) * y.coef
##     z1 <- (Intercept + x.min * x.coef + y.coef)/z.scal
##     z2 <- (Intercept + x.max * x.coef + y.coef)/z.scal
##     if (draw_lines) 
##         segments(x.min + y * yx.f, z1 + y * yz.f, x.max + y * 
##             yx.f, z2 + y * yz.f, lty = ltya, ...)
## }
## <bytecode: 0x000000002b3ae9e0>
## <environment: 0x000000002b3c5930>
## 
## $box3d
## function (...) 
## {
##     mem.par <- par(mar = mar, usr = usr)
##     lines(c(x.min, x.max), c(z.max, z.max), ...)
##     lines(c(0, y.max * yx.f) + x.max, c(0, y.max * yz.f) + z.max, 
##         ...)
##     lines(c(0, y.max * yx.f) + x.min, c(0, y.max * yz.f) + z.max, 
##         ...)
##     lines(c(x.max, x.max), c(z.min, z.max), ...)
##     lines(c(x.min, x.min), c(z.min, z.max), ...)
##     lines(c(x.min, x.max), c(z.min, z.min), ...)
## }
## <bytecode: 0x000000002b3b4900>
## <environment: 0x000000002b3c5930>
## 
## $contour3d
## function (f, x.count = 10, y.count = 10, type = "l", lty = "24", 
##     x.resolution = 50, y.resolution = 50, ...) 
## {
##     if (class(f) == "lm") {
##         vars <- all.vars(formula(f))
##     }
##     else vars <- c("z", "x", "y")
##     for (x1 in seq(x.range.fix[1], x.range.fix[2], length = x.count)) {
##         d <- data.frame(x1, seq(y.range.fix[1], y.range.fix[2], 
##             length = y.resolution))
##         names(d) <- vars[-1]
##         if (class(f) == "lm") {
##             d[vars[1]] <- predict(f, newdata = d)
##         }
##         else d[vars[1]] <- f(d[[1]], d[[2]])
##         xyz <- xyz.coords(d)
##         if (angle > 2) {
##             temp <- xyz$x
##             xyz$x <- xyz$y
##             xyz$y <- temp
##         }
##         y2 <- (xyz$y - y.add)/y.scal
##         x <- xyz$x/x.scal + yx.f * y2
##         y <- xyz$z/z.scal + yz.f * y2
##         mem.par <- par(mar = mar, usr = usr)
##         if (type == "h") {
##             y2 <- z.min + yz.f * y2
##             segments(x, y, x, y2, ...)
##             points(x, y, type = "p", ...)
##         }
##         else points(x, y, type = type, lty = lty, ...)
##     }
##     for (x2 in seq(y.range.fix[1], y.range.fix[2], length = y.count)) {
##         d <- data.frame(seq(x.range.fix[1], x.range.fix[2], length = x.resolution), 
##             x2)
##         names(d) <- vars[-1]
##         if (class(f) == "lm") {
##             d[vars[1]] <- predict(f, newdata = d)
##         }
##         else d[vars[1]] <- f(d[[1]], d[[2]])
##         xyz <- xyz.coords(d)
##         if (angle > 2) {
##             temp <- xyz$x
##             xyz$x <- xyz$y
##             xyz$y <- temp
##         }
##         y2 <- (xyz$y - y.add)/y.scal
##         x <- xyz$x/x.scal + yx.f * y2
##         y <- xyz$z/z.scal + yz.f * y2
##         mem.par <- par(mar = mar, usr = usr)
##         if (type == "h") {
##             y2 <- z.min + yz.f * y2
##             segments(x, y, x, y2, ...)
##             points(x, y, type = "p", ...)
##         }
##         else points(x, y, type = type, lty = lty, ...)
##     }
## }
## <bytecode: 0x000000002b3b6ac0>
## <environment: 0x000000002b3c5930>
## 
## $par.mar
## $par.mar$mar
## [1] 5.1 4.1 4.1 2.1

modelo4<-lm(Voto_PPK_2016_1~Uso_internet+ Analfabetismo_2014, data)
summary(modelo4)

## 
## Call:
## lm(formula = Voto_PPK_2016_1 ~ Uso_internet + Analfabetismo_2014, 
##     data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.8848 -4.0180 -0.6419  4.2686  8.7581 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)  
## (Intercept)         16.8702     7.5241   2.242   0.0354 *
## Uso_internet         0.1542     0.1427   1.080   0.2917  
## Analfabetismo_2014  -0.7963     0.4100  -1.942   0.0650 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.336 on 22 degrees of freedom
## Multiple R-squared:  0.4994, Adjusted R-squared:  0.4539 
## F-statistic: 10.98 on 2 and 22 DF,  p-value: 0.0004942

Interpretación (Intercept)= 16.8702 Uso_internet= 0.1542 Analfabetismo_2014= -0.7963 Adjusted R-squared:0.4539 p-value: 0.0004942

Modelo 5

#¿podemos graficar?
modelo5<-lm(Voto_PPK_2016_1~Uso_internet+ Analfabetismo_2014 + pbipc_2014, data)
summary(modelo5)

## 
## Call:
## lm(formula = Voto_PPK_2016_1 ~ Uso_internet + Analfabetismo_2014 + 
##     pbipc_2014, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.2876  -2.6319  -0.6423   2.4823   8.1417 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)  
## (Intercept)        19.2084720  7.2748230   2.640   0.0157 *
## Uso_internet       -0.0634842  0.1681417  -0.378   0.7097  
## Analfabetismo_2014 -0.8535105  0.3884828  -2.197   0.0400 *
## pbipc_2014          0.0003255  0.0001645   1.979   0.0618 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.037 on 20 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.5363, Adjusted R-squared:  0.4668 
## F-statistic: 7.711 on 3 and 20 DF,  p-value: 0.001292

Interpretación: ¿Qué nos dicen estas relaciones? (Intercept)= 19.2084720 Uso_internet=-0.0634842 A nalfabetismo_2014=-0.8535105 pbipc_2014=0.0003255 Adjusted R-squared: 0.4668 p-value: 0.001292

Modelo 6

#¿podemos graficar?
modelo6<-lm(Voto_PPK_2016_1~Uso_internet+ Analfabetismo_2014 + pbipc_2014, data)
summary(modelo6)

## 
## Call:
## lm(formula = Voto_PPK_2016_1 ~ Uso_internet + Analfabetismo_2014 + 
##     pbipc_2014, data = data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.2876  -2.6319  -0.6423   2.4823   8.1417 
## 
## Coefficients:
##                      Estimate Std. Error t value Pr(>|t|)  
## (Intercept)        19.2084720  7.2748230   2.640   0.0157 *
## Uso_internet       -0.0634842  0.1681417  -0.378   0.7097  
## Analfabetismo_2014 -0.8535105  0.3884828  -2.197   0.0400 *
## pbipc_2014          0.0003255  0.0001645   1.979   0.0618 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.037 on 20 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.5363, Adjusted R-squared:  0.4668 
## F-statistic: 7.711 on 3 and 20 DF,  p-value: 0.001292

p-value: 0.001292 Adjusted R-squared: 0.4668 Estimate Pr(>|t|) (Intercept) 19.2084720 0.0157 Uso_internet -0.0634842 0.7097 Analfabetismo_2014 -0.8535105 0.0400 pbipc_2014 0.0003255 0.0618

Como se observa, el p-value es menor a 0.05, por lo que rechazo mi hipóteis nula y acepto mi hipótesis alternativa la cual me indica que mi modelo es válido. Sin embargo, notamos cambios relevantes en el p-value de cada variable. Solo una variables es válida: tasa de analfabetismo. Por lo tanto, la variabilidad en el voto a PPK se explicará en un 46.7% por la tasa de analfabetismo. También se podría decir que la tasa de analfabetismo sigue siendo una variable predictora incluso tomando en cuenta otras variables de control como lo son uso de internet y PIB por grupo en cada región. Esta relación se explicará de acuerdo a la siguiente ecuación Y=-0.8535105*X+19.208 Por lo tanto, si la tasa de analfabetismo hubiera sido 0 en una región, el voto a PPK sería 19%. Asimismo, conforme la tasa de analfabetismo aumenta 1 punto, el porcentaje de voto a PPK dismiuye en -0.8535. Escogeremos este modelo como el más adecuado porque, si bien el modelo 5 tiene el mismo porcentaje de predictibilidad, este modelo posee más variables de control.

library(stargazer)
library(jtools)
stargazer(modelo4,modelo5,modelo6, type ="text")

## 
## ======================================================================================
##                                            Dependent variable:                        
##                     ------------------------------------------------------------------
##                                              Voto_PPK_2016_1                          
##                              (1)                    (2)                   (3)         
## --------------------------------------------------------------------------------------
## Uso_internet                0.154                 -0.063                -0.063        
##                            (0.143)                (0.168)               (0.168)       
##                                                                                       
## Analfabetismo_2014         -0.796*               -0.854**              -0.854**       
##                            (0.410)                (0.388)               (0.388)       
##                                                                                       
## pbipc_2014                                        0.0003*               0.0003*       
##                                                  (0.0002)              (0.0002)       
##                                                                                       
## Constant                   16.870**              19.208**              19.208**       
##                            (7.524)                (7.275)               (7.275)       
##                                                                                       
## --------------------------------------------------------------------------------------
## Observations                  25                    24                    24          
## R2                          0.499                  0.536                 0.536        
## Adjusted R2                 0.454                  0.467                 0.467        
## Residual Std. Error    5.336 (df = 22)        5.037 (df = 20)       5.037 (df = 20)   
## F Statistic         10.976*** (df = 2; 22) 7.711*** (df = 3; 20) 7.711*** (df = 3; 20)
## ======================================================================================
## Note:                                                      *p<0.1; **p<0.05; ***p<0.01

Regresión lineal múltiple

2022-06-15

3. Regresion Líneal Múltiple

Modelo 4

Modelo 5

Modelo 6