U2A3

Team 2. Cielo Aholiva Higuera Gutiérrez, Mariana Pompa Rivera, Saul López López y Cristina Arguelles Lema

26/04/2021

setwd("~/6to semestre/Estadistica aplicada/U2/U2A3")
library(pacman)
p_load("MASS", "class", "ggplot2")

REGRESION LINEAL MULTIPLE

La regresión lineal múltiple permite generar un modelo lineal en el que el valor de la variable dependiente o respuesta (Y) se determina a partir de un conjunto de variables independientes llamadas predictores (X1, X2, X3…). Es una extensión de la regresión lineal simple, por lo que es fundamental comprender esta última. Los modelos de regresión múltiple pueden emplearse para predecir el valor de la variable dependiente o para evaluar la influencia que tienen los predictores sobre ella (esto último se debe que analizar con cautela para no malinterpretar.

Los modelos lineales múltiples siguen la siguiente ecuación:

\[ Y_{i}=(\beta_{0}+\beta_{1}X_{1i}+\beta_{2}X_{2i}+\cdots+\beta_{n}X_{ni})+e_{i} \]

β0 es la ordenada en el origen, el valor de la variable dependiente Y cuando todos los predictores son cero.
βi : es el efecto promedio que tiene el incremento en una unidad de la variable predictora Xi sobre la variable dependiente Y , manteniéndose constantes el resto de variables. Se conocen como coeficientes parciales de regresión.
ei: es el residuo o error, la diferencia entre el valor observado y el estimado por el modelo.

Condiciones para la regresión lineal multiples

En sí, mismas condiciones que los modelos lineales simples,más algunas adicionales

Regresión multiple : Calidad del aire.

Para mayor estudio de los temas se analizaron diferentes variables de concentración dispersos en el aire, como lo son SO2, NO2, O3 y PM10, de igual manera, para determinar si existe relación con la movilidad en Mexico, estos están representados por las variables de Tiendas y ocio, supermercados, parques, estaciones de transporte, lugares de trabajo y zonas residenciales. Estos datos fueron obtenidos desde el periodo de Marzo - Junio en 2020.

Descarga de este código Para fines de reproducibilidad inmediata se incluye todo el código para su descarga

xfun::embed_file("U2A31.Rmd")

Download U2A31.Rmd

xfun::embed_file("Aire.xlsx")

Download Aire.xlsx

library(readxl)
library(DT)
aire <- read_excel("Aire.xlsx")
datatable(aire)

Relación de variables

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:MASS':
## 
##     select

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

datos <- as.data.frame(aire)
round(cor(x = datos, method = "pearson"), 3)

##                            SO2    NO2   PM10     O3 TiendasyOcio
## SO2                      1.000  0.918  0.241  0.557       -0.408
## NO2                      0.918  1.000  0.170  0.522       -0.472
## PM10                     0.241  0.170  1.000  0.198       -0.249
## O3                       0.557  0.522  0.198  1.000       -0.639
## TiendasyOcio            -0.408 -0.472 -0.249 -0.639        1.000
## SupermercadosyFarmacias -0.431 -0.476 -0.126 -0.611        0.920
## parques                 -0.381 -0.451 -0.151 -0.603        0.965
## EstacionesdeTransporte  -0.453 -0.515 -0.267 -0.661        0.971
## LugaresdeTrabajo        -0.420 -0.500 -0.178 -0.498        0.731
## ZonasResidenciales       0.431  0.512  0.255  0.579       -0.879
##                         SupermercadosyFarmacias parques EstacionesdeTransporte
## SO2                                      -0.431  -0.381                 -0.453
## NO2                                      -0.476  -0.451                 -0.515
## PM10                                     -0.126  -0.151                 -0.267
## O3                                       -0.611  -0.603                 -0.661
## TiendasyOcio                              0.920   0.965                  0.971
## SupermercadosyFarmacias                   1.000   0.907                  0.912
## parques                                   0.907   1.000                  0.935
## EstacionesdeTransporte                    0.912   0.935                  1.000
## LugaresdeTrabajo                          0.669   0.602                  0.694
## ZonasResidenciales                       -0.798  -0.784                 -0.853
##                         LugaresdeTrabajo ZonasResidenciales
## SO2                               -0.420              0.431
## NO2                               -0.500              0.512
## PM10                              -0.178              0.255
## O3                                -0.498              0.579
## TiendasyOcio                       0.731             -0.879
## SupermercadosyFarmacias            0.669             -0.798
## parques                            0.602             -0.784
## EstacionesdeTransporte             0.694             -0.853
## LugaresdeTrabajo                   1.000             -0.932
## ZonasResidenciales                -0.932              1.000

library(psych)

## 
## Attaching package: 'psych'

## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

multi.hist(x = datos, dcol = c("blue", "red"), dlty = c("dotted", "solid"),
           main = "")

library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggpairs(datos, lower = list(continuous = "smooth"),
        diag = list(continuous = "barDiag"), axisLabels = "none")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Del análisis preliminar se pueden extraer las siguientes conclusiones:

Las variables de movilidad, como lo es estaciones de transporte tienen una mayor regresión lineal con las concentraciones de O3 (r= -0.661), NO2 (r=-515) y SO2 (r=-0.453).
De igual manera, como era de esperarse, las concentraciones presentan mayor cantidad de regresión lineal entre otras concentraciones, un ejemplo de ello es el NO2 con el que tiene una gran relación con la concentración de SO2 (r= 0.918).
Asi mismo, se puede decir de las variables de movilidad, un ejemplo de ello es la relación que presenta las estaciones de transporte y parque (r=0.935)

Generar el modelo

modelo <- lm(O3 ~ SO2+ NO2+ PM10+ TiendasyOcio + SupermercadosyFarmacias + parques + EstacionesdeTransporte + LugaresdeTrabajo + ZonasResidenciales, data = datos )
summary(modelo)

## 
## Call:
## lm(formula = O3 ~ SO2 + NO2 + PM10 + TiendasyOcio + SupermercadosyFarmacias + 
##     parques + EstacionesdeTransporte + LugaresdeTrabajo + ZonasResidenciales, 
##     data = datos)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9914 -2.9020  0.2988  2.9356  9.9302 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              28.282464   2.353313  12.018  < 2e-16 ***
## SO2                      37.554014  11.866639   3.165  0.00211 ** 
## NO2                     -18.729921  12.326842  -1.519  0.13208    
## PM10                     -0.044527   0.060131  -0.740  0.46089    
## TiendasyOcio             -0.086913   0.161546  -0.538  0.59187    
## SupermercadosyFarmacias   0.032385   0.102452   0.316  0.75264    
## parques                   0.051739   0.128708   0.402  0.68863    
## EstacionesdeTransporte   -0.119834   0.084879  -1.412  0.16138    
## LugaresdeTrabajo         -0.001747   0.078946  -0.022  0.98239    
## ZonasResidenciales       -0.019562   0.273108  -0.072  0.94305    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.279 on 92 degrees of freedom
## Multiple R-squared:  0.5364, Adjusted R-squared:  0.491 
## F-statistic: 11.83 on 9 and 92 DF,  p-value: 3.553e-12

El modelo con todas las variables introducidas como predictores tiene un R2 media (0.5364), es capaz de explicar el 53.64% de la variabilidad observada en el ozono (O3). El p-value del modelo es significativo (3.553e-12). Muchos de las variables no son significativos, lo que es un indicativo de que podrían no contribuir al modelo.

Mejores predictores

step(object = modelo, direction = "both", trace = 1)

## Start:  AIC=306.06
## O3 ~ SO2 + NO2 + PM10 + TiendasyOcio + SupermercadosyFarmacias + 
##     parques + EstacionesdeTransporte + LugaresdeTrabajo + ZonasResidenciales
## 
##                           Df Sum of Sq    RSS    AIC
## - LugaresdeTrabajo         1     0.009 1684.9 304.06
## - ZonasResidenciales       1     0.094 1685.0 304.06
## - SupermercadosyFarmacias  1     1.830 1686.7 304.17
## - parques                  1     2.959 1687.8 304.24
## - TiendasyOcio             1     5.301 1690.2 304.38
## - PM10                     1    10.042 1694.9 304.66
## <none>                                 1684.9 306.06
## - EstacionesdeTransporte   1    36.504 1721.4 306.24
## - NO2                      1    42.281 1727.2 306.58
## - SO2                      1   183.415 1868.3 314.60
## 
## Step:  AIC=304.06
## O3 ~ SO2 + NO2 + PM10 + TiendasyOcio + SupermercadosyFarmacias + 
##     parques + EstacionesdeTransporte + ZonasResidenciales
## 
##                           Df Sum of Sq    RSS    AIC
## - ZonasResidenciales       1     0.177 1685.1 302.07
## - SupermercadosyFarmacias  1     1.827 1686.7 302.17
## - parques                  1     3.962 1688.8 302.30
## - TiendasyOcio             1     6.077 1691.0 302.42
## - PM10                     1    11.114 1696.0 302.73
## <none>                                 1684.9 304.06
## - EstacionesdeTransporte   1    39.877 1724.8 304.44
## - NO2                      1    42.902 1727.8 304.62
## + LugaresdeTrabajo         1     0.009 1684.9 306.06
## - SO2                      1   183.413 1868.3 312.60
## 
## Step:  AIC=302.07
## O3 ~ SO2 + NO2 + PM10 + TiendasyOcio + SupermercadosyFarmacias + 
##     parques + EstacionesdeTransporte
## 
##                           Df Sum of Sq    RSS    AIC
## - SupermercadosyFarmacias  1     1.859 1686.9 300.18
## - parques                  1     4.418 1689.5 300.33
## - TiendasyOcio             1     8.260 1693.3 300.57
## - PM10                     1    10.948 1696.0 300.73
## <none>                                 1685.1 302.07
## - EstacionesdeTransporte   1    40.852 1725.9 302.51
## - NO2                      1    47.710 1732.8 302.92
## + ZonasResidenciales       1     0.177 1684.9 304.06
## + LugaresdeTrabajo         1     0.092 1685.0 304.06
## - SO2                      1   189.847 1874.9 310.96
## 
## Step:  AIC=300.18
## O3 ~ SO2 + NO2 + PM10 + TiendasyOcio + parques + EstacionesdeTransporte
## 
##                           Df Sum of Sq    RSS    AIC
## - parques                  1     5.127 1692.0 298.49
## - TiendasyOcio             1     6.988 1693.9 298.60
## - PM10                     1     9.266 1696.2 298.74
## <none>                                 1686.9 300.18
## - EstacionesdeTransporte   1    38.994 1725.9 300.51
## - NO2                      1    46.052 1733.0 300.93
## + SupermercadosyFarmacias  1     1.859 1685.1 302.07
## + ZonasResidenciales       1     0.210 1686.7 302.17
## + LugaresdeTrabajo         1     0.182 1686.7 302.17
## - SO2                      1   189.599 1876.5 309.04
## 
## Step:  AIC=298.49
## O3 ~ SO2 + NO2 + PM10 + TiendasyOcio + EstacionesdeTransporte
## 
##                           Df Sum of Sq    RSS    AIC
## - TiendasyOcio             1     2.225 1694.3 296.62
## - PM10                     1     5.849 1697.9 296.84
## <none>                                 1692.0 298.49
## - EstacionesdeTransporte   1    39.229 1731.3 298.83
## - NO2                      1    46.465 1738.5 299.25
## + parques                  1     5.127 1686.9 300.18
## + SupermercadosyFarmacias  1     2.568 1689.5 300.33
## + LugaresdeTrabajo         1     1.374 1690.7 300.41
## + ZonasResidenciales       1     0.730 1691.3 300.44
## - SO2                      1   190.308 1882.3 307.36
## 
## Step:  AIC=296.62
## O3 ~ SO2 + NO2 + PM10 + EstacionesdeTransporte
## 
##                           Df Sum of Sq    RSS    AIC
## - PM10                     1      6.10 1700.4 294.99
## <none>                                 1694.3 296.62
## - NO2                      1     47.11 1741.4 297.42
## + LugaresdeTrabajo         1      2.70 1691.6 298.46
## + TiendasyOcio             1      2.23 1692.0 298.49
## + ZonasResidenciales       1      2.02 1692.2 298.50
## + SupermercadosyFarmacias  1      0.85 1693.4 298.57
## + parques                  1      0.36 1693.9 298.60
## - SO2                      1    189.08 1883.3 305.41
## - EstacionesdeTransporte   1    791.96 2486.2 333.74
## 
## Step:  AIC=294.99
## O3 ~ SO2 + NO2 + EstacionesdeTransporte
## 
##                           Df Sum of Sq    RSS    AIC
## <none>                                 1700.4 294.99
## - NO2                      1     42.29 1742.7 295.50
## + PM10                     1      6.10 1694.3 296.62
## + LugaresdeTrabajo         1      2.68 1697.7 296.83
## + TiendasyOcio             1      2.47 1697.9 296.84
## + ZonasResidenciales       1      1.51 1698.8 296.90
## + parques                  1      0.01 1700.3 296.99
## + SupermercadosyFarmacias  1      0.00 1700.4 296.99
## - SO2                      1    183.66 1884.0 303.45
## - EstacionesdeTransporte   1    803.36 2503.7 332.46

## 
## Call:
## lm(formula = O3 ~ SO2 + NO2 + EstacionesdeTransporte, data = datos)
## 
## Coefficients:
##            (Intercept)                     SO2                     NO2  
##                27.2678                 35.3704                -17.3799  
## EstacionesdeTransporte  
##                -0.1386

El mejor modelo resultante del proceso de selección ha sido:

modelo <- (lm(formula = O3 ~ SO2 + NO2 + EstacionesdeTransporte, data = datos))
summary(modelo)

## 
## Call:
## lm(formula = O3 ~ SO2 + NO2 + EstacionesdeTransporte, data = datos)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.1479  -2.7045   0.4738   2.9803   9.8968 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             27.26784    1.44371  18.887  < 2e-16 ***
## SO2                     35.37036   10.87137   3.254  0.00156 ** 
## NO2                    -17.37991   11.13182  -1.561  0.12168    
## EstacionesdeTransporte  -0.13858    0.02037  -6.805 8.11e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.165 on 98 degrees of freedom
## Multiple R-squared:  0.5321, Adjusted R-squared:  0.5178 
## F-statistic: 37.15 on 3 and 98 DF,  p-value: 4.015e-16

Es recomendable mostrar el intervalo de confianza para cada uno de los coeficientes parciales de regresión:

confint(lm(formula = O3 ~ SO2 + NO2 + EstacionesdeTransporte, data = datos))

##                              2.5 %     97.5 %
## (Intercept)             24.4028519 30.1328271
## SO2                     13.7964687 56.9442430
## NO2                    -39.4706475  4.7108305
## EstacionesdeTransporte  -0.1789987 -0.0981663

Condiciones para la regresión múltiple lineal

library(ggplot2)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

plot1 <- ggplot(data = datos, aes(SO2, modelo$residuals)) +
    geom_point() + geom_smooth(color = "firebrick") + geom_hline(yintercept = 0) +
    theme_bw()
plot2 <- ggplot(data = datos, aes(NO2, modelo$residuals)) +
    geom_point() + geom_smooth(color = "firebrick") + geom_hline(yintercept = 0) +
    theme_bw()
plot3 <- ggplot(data = datos, aes(EstacionesdeTransporte, modelo$residuals)) +
    geom_point() + geom_smooth(color = "firebrick") + geom_hline(yintercept = 0) +
    theme_bw()
grid.arrange(plot1, plot2, plot3)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Se cumple la linealidad para todos los predictores

Distribución normal de residuos

qqnorm(modelo$residuals)
qqline(modelo$residuals)

shapiro.test(modelo$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  modelo$residuals
## W = 0.99295, p-value = 0.8782

Tanto el análisis gráfico como es test de hipótesis confirman la normalidad.

ggplot(data = datos, aes(modelo$fitted.values, modelo$residuals)) +
geom_point() +
geom_smooth(color = "firebrick", se = FALSE) +
geom_hline(yintercept = 0) +
theme_bw()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

bptest(modelo)

## 
##  studentized Breusch-Pagan test
## 
## data:  modelo
## BP = 8.0785, df = 3, p-value = 0.04442

No hay evidencias de falta de homocedasticidad.

Matriz de correlación entre predictores.

library(corrplot)

## corrplot 0.84 loaded

corrplot(cor(dplyr::select(datos, SO2, NO2,EstacionesdeTransporte)),
         method = "number", tl.col = "black")

Análisis de Inflación de Varianza (VIF):

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

## The following object is masked from 'package:dplyr':
## 
##     recode

vif(modelo)

##                    SO2                    NO2 EstacionesdeTransporte 
##               6.417452               6.947694               1.366745

No hay predictores que muestren una correlación lineal muy alta ni inflación de varianza.

library(car)
dwt(modelo, alternative = "two.sided")

##  lag Autocorrelation D-W Statistic p-value
##    1       0.4120607      1.094955       0
##  Alternative hypothesis: rho != 0

No hay evidencia de autocorrelación

library(dplyr)
datos$studentized_residual <- rstudent(modelo)
ggplot(data = datos, aes(x = predict(modelo), y = abs(studentized_residual))) +
geom_hline(yintercept = 3, color = "grey", linetype = "dashed") +

geom_point(aes(color = ifelse(abs(studentized_residual) > 3, 'red', 'black'))) +
scale_color_identity() +
labs(title = "Distribución de los residuos studentized",
     x = "predicción modelo") + 
theme_bw() + theme(plot.title = element_text(hjust = 0.5))

which(abs(datos$studentized_residual) > 3)

## [1] 1

summary(influence.measures(modelo))

## Potentially influential observations of
##   lm(formula = O3 ~ SO2 + NO2 + EstacionesdeTransporte, data = datos) :
## 
##    dfb.1_ dfb.SO2 dfb.NO2 dfb.EstT dffit   cov.r   cook.d  hat    
## 1   0.77   2.25_* -1.94_*  0.62     2.42_*  1.13_*  1.34_*  0.37_*
## 6   0.02   0.01    0.00    0.02     0.03    1.12_*  0.00    0.07  
## 7  -0.10  -0.51    0.51   -0.13    -0.60    1.18_*  0.09    0.18_*
## 11 -0.04  -0.24    0.37   -0.21    -0.65_*  0.97    0.10    0.09  
## 12  0.05  -0.13    0.09    0.12     0.25    1.12_*  0.02    0.09  
## 13  0.10  -0.14    0.12    0.17     0.26    1.14_*  0.02    0.10  
## 16  0.03  -0.03    0.03    0.04     0.05    1.14_*  0.00    0.09  
## 27  0.23  -0.91    0.95    0.14     0.99_*  0.95    0.24    0.14_*
## 30  0.10  -0.05    0.10    0.06     0.13    1.17_*  0.00    0.11  
## 91  0.05   0.01    0.02    0.05    -0.07    1.13_*  0.00    0.08  
## 99  0.26   0.05    0.12    0.24    -0.46    0.83_*  0.05    0.03

influencePlot(modelo)

##      StudRes        Hat      CookD
## 1   3.134001 0.37351109 1.34305566
## 7  -1.296184 0.17821420 0.09045942
## 27  2.462209 0.14037587 0.23534192
## 99 -2.543409 0.03129969 0.04949248

Conclusión

El modelo lineal multiple es de

\[ Esperanza de vida: 35.3704SO^2 - 17.3799NO^2 - 0.1386EstacionesdeTransporte \]

es capaz de explicar el 53.64% de la variabilidad observada en el ozono (O3) (R2 =0.5364, R2 - Adjusted = 0.491).El test F muestra que es significativo (p-value: 3.553E-12).