U2A4

Generar el Modelo
Selección de los mejores predictores
Validación de condiciones para la regresión múltiple lineal
Identificación de posibles valores atípicos o influyentes
Conclusión

La regresión lineal múltiple permite generar un modelo lineal en el que el valor de la variable dependiente o respuesta (Y) se determina a partir de un conjunto de variables independientes llamadas predictores (X1, X2, X3…). Es una extensión de la regresión lineal simple, por lo que es fundamental comprender esta última. Los modelos de regresión múltiple pueden emplearse para predecir el valor de la variable dependiente o para evaluar la influencia que tienen los predictores sobre ella (esto último se debe que analizar con cautela para no malinterpretar causa-efecto).

Los modelos lineales múltiples siguen la siguiente ecuación:

\[ Y_{i}=(\beta_{0}+\beta_{1}X_{1i}+\beta_{2}X_{2i}+\cdots+\beta_{n}X_{ni})+e_{i} \] * β0 es la ordenada en el origen, el valor de la variable dependiente Y cuando todos los predictores son cero.

βi : es el efecto promedio que tiene el incremento en una unidad de la variable predictora Xi sobre la variable dependiente Y , manteniéndose constantes el resto de variables. Se conocen como coeficientes parciales de regresión.
ei: es el residuo o error, la diferencia entre el valor observado y el estimado por el modelo.

library(DT)
library(readxl)

aire <- read_excel("Aire.xlsx")
datatable(aire)

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

datos <- as.data.frame(aire)
round(cor(x = datos, method = "pearson"), 3)

##                            SO2    NO2   PM10     O3 TiendasyOcio
## SO2                      1.000  0.918  0.241  0.557       -0.408
## NO2                      0.918  1.000  0.170  0.522       -0.472
## PM10                     0.241  0.170  1.000  0.198       -0.249
## O3                       0.557  0.522  0.198  1.000       -0.639
## TiendasyOcio            -0.408 -0.472 -0.249 -0.639        1.000
## SupermercadosyFarmacias -0.431 -0.476 -0.126 -0.611        0.920
## parques                 -0.381 -0.451 -0.151 -0.603        0.965
## EstacionesdeTransporte  -0.453 -0.515 -0.267 -0.661        0.971
## LugaresdeTrabajo        -0.420 -0.500 -0.178 -0.498        0.731
## ZonasResidenciales       0.431  0.512  0.255  0.579       -0.879
##                         SupermercadosyFarmacias parques EstacionesdeTransporte
## SO2                                      -0.431  -0.381                 -0.453
## NO2                                      -0.476  -0.451                 -0.515
## PM10                                     -0.126  -0.151                 -0.267
## O3                                       -0.611  -0.603                 -0.661
## TiendasyOcio                              0.920   0.965                  0.971
## SupermercadosyFarmacias                   1.000   0.907                  0.912
## parques                                   0.907   1.000                  0.935
## EstacionesdeTransporte                    0.912   0.935                  1.000
## LugaresdeTrabajo                          0.669   0.602                  0.694
## ZonasResidenciales                       -0.798  -0.784                 -0.853
##                         LugaresdeTrabajo ZonasResidenciales
## SO2                               -0.420              0.431
## NO2                               -0.500              0.512
## PM10                              -0.178              0.255
## O3                                -0.498              0.579
## TiendasyOcio                       0.731             -0.879
## SupermercadosyFarmacias            0.669             -0.798
## parques                            0.602             -0.784
## EstacionesdeTransporte             0.694             -0.853
## LugaresdeTrabajo                   1.000             -0.932
## ZonasResidenciales                -0.932              1.000

library(psych)
multi.hist(x = datos, dcol = c("blue", "red"), dlty = c("dotted", "solid"),
           main = "")

library(GGally)

## Loading required package: ggplot2

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

ggpairs(datos, lower = list(continuous = "smooth"),
        diag = list(continuous = "barDiag"), axisLabels = "none")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Del análisis preliminar se pueden extraer las siguientes conclusiones:

Las variables de movilidad, como lo es estaciones de transporte tienen una mayor regresión lineal con las concentraciones de O3 (r= -0.661), NO2 (r=-515) y SO2 (r=-0.453).
De igual manera, como era de esperarse, las concentraciones presentan mayor cantidad de regresión lineal entre otras concentraciones, un ejemplo de ello es el NO2 con el que tiene una gran relación con la concentración de SO2 (r= 0.918).
Asi mismo, se puede decir de las variables de movilidad, un ejemplo de ello es la relación que presenta las estaciones de transporte y parque (r=0.935)

Generar el Modelo

modelo <- lm(O3 ~ TiendasyOcio + SupermercadosyFarmacias + parques + EstacionesdeTransporte +
               LugaresdeTrabajo + ZonasResidenciales + PM10 + NO2 + SO2, data = datos )
summary(modelo)

## 
## Call:
## lm(formula = O3 ~ TiendasyOcio + SupermercadosyFarmacias + parques + 
##     EstacionesdeTransporte + LugaresdeTrabajo + ZonasResidenciales + 
##     PM10 + NO2 + SO2, data = datos)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.9914 -2.9020  0.2988  2.9356  9.9302 
## 
## Coefficients:
##                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              28.282464   2.353313  12.018  < 2e-16 ***
## TiendasyOcio             -0.086913   0.161546  -0.538  0.59187    
## SupermercadosyFarmacias   0.032385   0.102452   0.316  0.75264    
## parques                   0.051739   0.128708   0.402  0.68863    
## EstacionesdeTransporte   -0.119834   0.084879  -1.412  0.16138    
## LugaresdeTrabajo         -0.001747   0.078946  -0.022  0.98239    
## ZonasResidenciales       -0.019562   0.273108  -0.072  0.94305    
## PM10                     -0.044527   0.060131  -0.740  0.46089    
## NO2                     -18.729921  12.326842  -1.519  0.13208    
## SO2                      37.554014  11.866639   3.165  0.00211 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.279 on 92 degrees of freedom
## Multiple R-squared:  0.5364, Adjusted R-squared:  0.491 
## F-statistic: 11.83 on 9 and 92 DF,  p-value: 3.553e-12

El modelo con todas las variables introducidas como predictores tiene un R2 media (0.5364), es capaz de explicar el 53,64% de la variabilidad observada en el Ozono.

Selección de los mejores predictores

step(object = modelo, direction = "both", trace = 1)

## Start:  AIC=306.06
## O3 ~ TiendasyOcio + SupermercadosyFarmacias + parques + EstacionesdeTransporte + 
##     LugaresdeTrabajo + ZonasResidenciales + PM10 + NO2 + SO2
## 
##                           Df Sum of Sq    RSS    AIC
## - LugaresdeTrabajo         1     0.009 1684.9 304.06
## - ZonasResidenciales       1     0.094 1685.0 304.06
## - SupermercadosyFarmacias  1     1.830 1686.7 304.17
## - parques                  1     2.959 1687.8 304.24
## - TiendasyOcio             1     5.301 1690.2 304.38
## - PM10                     1    10.042 1694.9 304.66
## <none>                                 1684.9 306.06
## - EstacionesdeTransporte   1    36.504 1721.4 306.24
## - NO2                      1    42.281 1727.2 306.58
## - SO2                      1   183.415 1868.3 314.60
## 
## Step:  AIC=304.06
## O3 ~ TiendasyOcio + SupermercadosyFarmacias + parques + EstacionesdeTransporte + 
##     ZonasResidenciales + PM10 + NO2 + SO2
## 
##                           Df Sum of Sq    RSS    AIC
## - ZonasResidenciales       1     0.177 1685.1 302.07
## - SupermercadosyFarmacias  1     1.827 1686.7 302.17
## - parques                  1     3.962 1688.8 302.30
## - TiendasyOcio             1     6.077 1691.0 302.42
## - PM10                     1    11.114 1696.0 302.73
## <none>                                 1684.9 304.06
## - EstacionesdeTransporte   1    39.877 1724.8 304.44
## - NO2                      1    42.902 1727.8 304.62
## + LugaresdeTrabajo         1     0.009 1684.9 306.06
## - SO2                      1   183.413 1868.3 312.60
## 
## Step:  AIC=302.07
## O3 ~ TiendasyOcio + SupermercadosyFarmacias + parques + EstacionesdeTransporte + 
##     PM10 + NO2 + SO2
## 
##                           Df Sum of Sq    RSS    AIC
## - SupermercadosyFarmacias  1     1.859 1686.9 300.18
## - parques                  1     4.418 1689.5 300.33
## - TiendasyOcio             1     8.260 1693.3 300.57
## - PM10                     1    10.948 1696.0 300.73
## <none>                                 1685.1 302.07
## - EstacionesdeTransporte   1    40.852 1725.9 302.51
## - NO2                      1    47.710 1732.8 302.92
## + ZonasResidenciales       1     0.177 1684.9 304.06
## + LugaresdeTrabajo         1     0.092 1685.0 304.06
## - SO2                      1   189.847 1874.9 310.96
## 
## Step:  AIC=300.18
## O3 ~ TiendasyOcio + parques + EstacionesdeTransporte + PM10 + 
##     NO2 + SO2
## 
##                           Df Sum of Sq    RSS    AIC
## - parques                  1     5.127 1692.0 298.49
## - TiendasyOcio             1     6.988 1693.9 298.60
## - PM10                     1     9.266 1696.2 298.74
## <none>                                 1686.9 300.18
## - EstacionesdeTransporte   1    38.994 1725.9 300.51
## - NO2                      1    46.052 1733.0 300.93
## + SupermercadosyFarmacias  1     1.859 1685.1 302.07
## + ZonasResidenciales       1     0.210 1686.7 302.17
## + LugaresdeTrabajo         1     0.182 1686.7 302.17
## - SO2                      1   189.599 1876.5 309.04
## 
## Step:  AIC=298.49
## O3 ~ TiendasyOcio + EstacionesdeTransporte + PM10 + NO2 + SO2
## 
##                           Df Sum of Sq    RSS    AIC
## - TiendasyOcio             1     2.225 1694.3 296.62
## - PM10                     1     5.849 1697.9 296.84
## <none>                                 1692.0 298.49
## - EstacionesdeTransporte   1    39.229 1731.3 298.83
## - NO2                      1    46.465 1738.5 299.25
## + parques                  1     5.127 1686.9 300.18
## + SupermercadosyFarmacias  1     2.568 1689.5 300.33
## + LugaresdeTrabajo         1     1.374 1690.7 300.41
## + ZonasResidenciales       1     0.730 1691.3 300.44
## - SO2                      1   190.308 1882.3 307.36
## 
## Step:  AIC=296.62
## O3 ~ EstacionesdeTransporte + PM10 + NO2 + SO2
## 
##                           Df Sum of Sq    RSS    AIC
## - PM10                     1      6.10 1700.4 294.99
## <none>                                 1694.3 296.62
## - NO2                      1     47.11 1741.4 297.42
## + LugaresdeTrabajo         1      2.70 1691.6 298.46
## + TiendasyOcio             1      2.23 1692.0 298.49
## + ZonasResidenciales       1      2.02 1692.2 298.50
## + SupermercadosyFarmacias  1      0.85 1693.4 298.57
## + parques                  1      0.36 1693.9 298.60
## - SO2                      1    189.08 1883.3 305.41
## - EstacionesdeTransporte   1    791.96 2486.2 333.74
## 
## Step:  AIC=294.99
## O3 ~ EstacionesdeTransporte + NO2 + SO2
## 
##                           Df Sum of Sq    RSS    AIC
## <none>                                 1700.4 294.99
## - NO2                      1     42.29 1742.7 295.50
## + PM10                     1      6.10 1694.3 296.62
## + LugaresdeTrabajo         1      2.68 1697.7 296.83
## + TiendasyOcio             1      2.47 1697.9 296.84
## + ZonasResidenciales       1      1.51 1698.8 296.90
## + parques                  1      0.01 1700.3 296.99
## + SupermercadosyFarmacias  1      0.00 1700.4 296.99
## - SO2                      1    183.66 1884.0 303.45
## - EstacionesdeTransporte   1    803.36 2503.7 332.46

## 
## Call:
## lm(formula = O3 ~ EstacionesdeTransporte + NO2 + SO2, data = datos)
## 
## Coefficients:
##            (Intercept)  EstacionesdeTransporte                     NO2  
##                27.2678                 -0.1386                -17.3799  
##                    SO2  
##                35.3704

El mejor modelo resultante del proceso de selección ha sido:

modelo <- (lm(formula = O3 ~ EstacionesdeTransporte+ NO2+ SO2, data = datos))
summary(modelo)

## 
## Call:
## lm(formula = O3 ~ EstacionesdeTransporte + NO2 + SO2, data = datos)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.1479  -2.7045   0.4738   2.9803   9.8968 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             27.26784    1.44371  18.887  < 2e-16 ***
## EstacionesdeTransporte  -0.13858    0.02037  -6.805 8.11e-10 ***
## NO2                    -17.37991   11.13182  -1.561  0.12168    
## SO2                     35.37036   10.87137   3.254  0.00156 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.165 on 98 degrees of freedom
## Multiple R-squared:  0.5321, Adjusted R-squared:  0.5178 
## F-statistic: 37.15 on 3 and 98 DF,  p-value: 4.015e-16

Es recomendable mostrar el intervalo de confianza para cada uno de los coeficientes parciales de regresión:

confint(lm(formula = O3 ~ EstacionesdeTransporte+ NO2+ SO2, data = datos))

##                              2.5 %     97.5 %
## (Intercept)             24.4028519 30.1328271
## EstacionesdeTransporte  -0.1789987 -0.0981663
## NO2                    -39.4706475  4.7108305
## SO2                     13.7964687 56.9442430

Validación de condiciones para la regresión múltiple lineal

library(ggplot2)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

plot1 <- ggplot(data = datos, aes(NO2, modelo$residuals)) +
    geom_point() + geom_smooth(color = "firebrick") + geom_hline(yintercept = 0) +
    theme_bw()
plot2 <- ggplot(data = datos, aes(SO2, modelo$residuals)) +
    geom_point() + geom_smooth(color = "firebrick") + geom_hline(yintercept = 0) +
    theme_bw()
plot3 <- ggplot(data = datos, aes(EstacionesdeTransporte, modelo$residuals)) +
    geom_point() + geom_smooth(color = "firebrick") + geom_hline(yintercept = 0) +
    theme_bw()
grid.arrange(plot1, plot2, plot3)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Se cumple la linealidad para todos los predictores

Distribución normal de los residuos:

qqnorm(modelo$residuals)
qqline(modelo$residuals)

shapiro.test(modelo$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  modelo$residuals
## W = 0.99295, p-value = 0.8782

ggplot(data = datos, aes(modelo$fitted.values, modelo$residuals)) +
geom_point() +
geom_smooth(color = "firebrick", se = FALSE) +
geom_hline(yintercept = 0) +
theme_bw()

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

library(lmtest)

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

bptest(modelo)

## 
##  studentized Breusch-Pagan test
## 
## data:  modelo
## BP = 8.0785, df = 3, p-value = 0.04442

No hay evidencia de homocedasticidad

Matriz de correlación entre predictores.

library(corrplot)

## corrplot 0.84 loaded

corrplot(cor(dplyr::select(datos, NO2, SO2,EstacionesdeTransporte)),
         method = "number", tl.col = "blue")

Análisis de Inflación de Varianza (VIF):

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:psych':
## 
##     logit

## The following object is masked from 'package:dplyr':
## 
##     recode

vif(modelo)

## EstacionesdeTransporte                    NO2                    SO2 
##               1.366745               6.947694               6.417452

Autocorrelación:

library(car)
dwt(modelo, alternative = "two.sided")

##  lag Autocorrelation D-W Statistic p-value
##    1       0.4120607      1.094955       0
##  Alternative hypothesis: rho != 0

Identificación de posibles valores atípicos o influyentes

library(dplyr)
datos$studentized_residual <- rstudent(modelo)
ggplot(data = datos, aes(x = predict(modelo), y = abs(studentized_residual))) +
geom_hline(yintercept = 3, color = "grey", linetype = "dashed") +
# se identifican en rojo observaciones con residuos estandarizados absolutos > 3
geom_point(aes(color = ifelse(abs(studentized_residual) > 3, 'red', 'black'))) +
scale_color_identity() +
labs(title = "Distribución de los residuos studentized",
     x = "predicción modelo") + 
theme_bw() + theme(plot.title = element_text(hjust = 0.5))

which(abs(datos$studentized_residual) > 3)

## [1] 1

summary(influence.measures(modelo))

## Potentially influential observations of
##   lm(formula = O3 ~ EstacionesdeTransporte + NO2 + SO2, data = datos) :
## 
##    dfb.1_ dfb.EstT dfb.NO2 dfb.SO2 dffit   cov.r   cook.d  hat    
## 1   0.77   0.62    -1.94_*  2.25_*  2.42_*  1.13_*  1.34_*  0.37_*
## 6   0.02   0.02     0.00    0.01    0.03    1.12_*  0.00    0.07  
## 7  -0.10  -0.13     0.51   -0.51   -0.60    1.18_*  0.09    0.18_*
## 11 -0.04  -0.21     0.37   -0.24   -0.65_*  0.97    0.10    0.09  
## 12  0.05   0.12     0.09   -0.13    0.25    1.12_*  0.02    0.09  
## 13  0.10   0.17     0.12   -0.14    0.26    1.14_*  0.02    0.10  
## 16  0.03   0.04     0.03   -0.03    0.05    1.14_*  0.00    0.09  
## 27  0.23   0.14     0.95   -0.91    0.99_*  0.95    0.24    0.14_*
## 30  0.10   0.06     0.10   -0.05    0.13    1.17_*  0.00    0.11  
## 91  0.05   0.05     0.02    0.01   -0.07    1.13_*  0.00    0.08  
## 99  0.26   0.24     0.12    0.05   -0.46    0.83_*  0.05    0.03

influencePlot(modelo)

##      StudRes        Hat      CookD
## 1   3.134001 0.37351109 1.34305566
## 7  -1.296184 0.17821420 0.09045942
## 27  2.462209 0.14037587 0.23534192
## 99 -2.543409 0.03129969 0.04949248

Conclusión

El modelo lineal múltiple es de \[O3:35.3704SO2−17.3799NO2−0.1386EstacionesdeTransporte \] es capaz de explicar el 53.64% de la variabilidad observada en la esperanza de vida (R~2: 0.5364, R^2-Adjusted: 0.491). El test F muestra que es significativo (p-value: 3.553e-12).