1. Čo je multikolinearita

Multikolinearita znamená, že niektoré vysvetľujúce premenné v regresii sú medzi sebou silne (takmer lineárne) závislé. Odhady koeficientov metódou najmenších štvorcov zostávajú neskreslené, ale:


2. Dáta a východiskový model

Budeme pracovať s dátami o spotrebe elektriny a počasí. Databáza obsahuje 10-minútové merania a tri zóny spotreby.

Budeme odhadovať model:

\[ PowerConsumption\_Zone1_t = \beta_0 + \beta_1 Temperature_t + \beta_2 Humidity_t + \beta_3 WindSpeed_t + \beta_4 GeneralDiffuseFlows_t + \beta_5 DiffuseFlows_t + \beta_6 PowerConsumption\_Zone2_t + u_t \]

Premennú PowerConsumption_Zone2 pridávame zámerne – je silno korelovaná so Zone1, takže vie vytvoriť výraznú multikolinearitu.

Predpoklad: súbor powerconsumption.csv je v rovnakom priečinku ako tento .Rmd. Ak nie, uprav cestu v read.csv().

udaje <- read.csv("powerconsumption.csv", header = TRUE)

# Datetime si pre istotu konvertujeme na časový typ (nie je nutné pre regresiu, ale hodí sa pri kontrole)
udaje$Datetime <- as.POSIXct(udaje$Datetime, format = "%m/%d/%Y %H:%M")

str(udaje)
## 'data.frame':    52416 obs. of  9 variables:
##  $ Datetime              : POSIXct, format: "2017-01-01 00:00:00" "2017-01-01 00:10:00" ...
##  $ Temperature           : num  6.56 6.41 6.31 6.12 5.92 ...
##  $ Humidity              : num  73.8 74.5 74.5 75 75.7 76.9 77.7 78.2 78.1 77.3 ...
##  $ WindSpeed             : num  0.083 0.083 0.08 0.083 0.081 0.081 0.08 0.085 0.081 0.082 ...
##  $ GeneralDiffuseFlows   : num  0.051 0.07 0.062 0.091 0.048 0.059 0.048 0.055 0.066 0.062 ...
##  $ DiffuseFlows          : num  0.119 0.085 0.1 0.096 0.085 0.108 0.096 0.093 0.141 0.111 ...
##  $ PowerConsumption_Zone1: num  34056 29815 29128 28229 27336 ...
##  $ PowerConsumption_Zone2: num  16129 19375 19007 18361 17872 ...
##  $ PowerConsumption_Zone3: num  20241 20131 19668 18899 18442 ...
summary(udaje)
##     Datetime                    Temperature        Humidity       WindSpeed    
##  Min.   :2017-01-01 00:00:00   Min.   : 3.247   Min.   :11.34   Min.   :0.050  
##  1st Qu.:2017-04-01 23:57:30   1st Qu.:14.410   1st Qu.:58.31   1st Qu.:0.078  
##  Median :2017-07-01 23:55:00   Median :18.780   Median :69.86   Median :0.086  
##  Mean   :2017-07-01 23:55:00   Mean   :18.810   Mean   :68.26   Mean   :1.959  
##  3rd Qu.:2017-09-30 23:52:30   3rd Qu.:22.890   3rd Qu.:81.40   3rd Qu.:4.915  
##  Max.   :2017-12-30 23:50:00   Max.   :40.010   Max.   :94.80   Max.   :6.483  
##  GeneralDiffuseFlows  DiffuseFlows     PowerConsumption_Zone1
##  Min.   :   0.004    Min.   :  0.011   Min.   :13896         
##  1st Qu.:   0.062    1st Qu.:  0.122   1st Qu.:26311         
##  Median :   5.035    Median :  4.456   Median :32266         
##  Mean   : 182.697    Mean   : 75.028   Mean   :32345         
##  3rd Qu.: 319.600    3rd Qu.:101.000   3rd Qu.:37309         
##  Max.   :1163.000    Max.   :936.000   Max.   :52204         
##  PowerConsumption_Zone2 PowerConsumption_Zone3
##  Min.   : 8560          Min.   : 5935         
##  1st Qu.:16981          1st Qu.:13129         
##  Median :20823          Median :16415         
##  Mean   :21043          Mean   :17835         
##  3rd Qu.:24714          3rd Qu.:21624         
##  Max.   :37409          Max.   :47598

3. Odhad základného regresného modelu

model <- lm(
  PowerConsumption_Zone1 ~ Temperature + Humidity + WindSpeed +
    GeneralDiffuseFlows + DiffuseFlows + PowerConsumption_Zone2,
  data = udaje
)
summary(model)
## 
## Call:
## lm(formula = PowerConsumption_Zone1 ~ Temperature + Humidity + 
##     WindSpeed + GeneralDiffuseFlows + DiffuseFlows + PowerConsumption_Zone2, 
##     data = udaje)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17249.8  -2493.9     71.9   2444.0  15359.2 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             5.803e+03  1.455e+02  39.875  < 2e-16 ***
## Temperature             1.926e+02  4.013e+00  48.002  < 2e-16 ***
## Humidity                6.218e+00  1.295e+00   4.801 1.58e-06 ***
## WindSpeed              -5.599e+01  8.156e+00  -6.865 6.72e-12 ***
## GeneralDiffuseFlows    -3.408e-01  8.797e-02  -3.875 0.000107 ***
## DiffuseFlows            1.439e+00  1.634e-01   8.811  < 2e-16 ***
## PowerConsumption_Zone2  1.072e+00  3.514e-03 305.066  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3811 on 52409 degrees of freedom
## Multiple R-squared:  0.7144, Adjusted R-squared:  0.7144 
## F-statistic: 2.185e+04 on 6 and 52409 DF,  p-value: < 2.2e-16

4. Korelačná matica a scatterploty

Korelácie zachytávajú párové vzťahy. Pri multikolinearite nás zaujíma hlavne korelácia medzi vysvetľujúcimi premennými.

xvars <- udaje[, c("Temperature", "Humidity", "WindSpeed",
                   "GeneralDiffuseFlows", "DiffuseFlows",
                   "PowerConsumption_Zone2")]

round(cor(xvars), 3)
##                        Temperature Humidity WindSpeed GeneralDiffuseFlows
## Temperature                  1.000   -0.460     0.477               0.460
## Humidity                    -0.460    1.000    -0.136              -0.468
## WindSpeed                    0.477   -0.136     1.000               0.134
## GeneralDiffuseFlows          0.460   -0.468     0.134               1.000
## DiffuseFlows                 0.197   -0.257    -0.001               0.565
## PowerConsumption_Zone2       0.382   -0.295     0.146               0.157
##                        DiffuseFlows PowerConsumption_Zone2
## Temperature                   0.197                  0.382
## Humidity                     -0.257                 -0.295
## WindSpeed                    -0.001                  0.146
## GeneralDiffuseFlows           0.565                  0.157
## DiffuseFlows                  1.000                  0.045
## PowerConsumption_Zone2        0.045                  1.000
pairs(xvars, main = "Scatterplotová matica – vysvetľujúce premenné")


5. VIF (Variance Inflation Factor)

VIF pre premennú \(x_j\):

\[ VIF_j = \frac{1}{1 - R_j^2} \]

Pravidlá (orientačne): - VIF > 5: problém môže byť relevantný, - VIF > 10: silná multikolinearita.

library(car)
vif(model)
##            Temperature               Humidity              WindSpeed 
##               1.965281               1.464191               1.324643 
##    GeneralDiffuseFlows           DiffuseFlows PowerConsumption_Zone2 
##               1.952539               1.486298               1.205864

6. Condition Number (podmienenosť)

Condition number je indikátor, ktorý vie zachytiť aj “cyklickú” multikolinearitu medzi viacerými premennými naraz. Pravidlo (orientačne): - < 10 nízka, - 10–30 mierna, - 30–100 silná, - > 100 veľmi vážna.

Dôležité: Condition number je citlivý na mierku premenných (jednotky), preto pri porovnávaní často pomáha škálovanie.

X <- model.matrix(model)[, -1]   # bez interceptu
XtX <- t(X) %*% X
eig <- eigen(XtX)

condition_number <- sqrt(max(eig$values) / min(eig$values))
condition_number
## [1] 10756.87

7. Riešenia multikolinearity

7.1 Vynechanie problematickej premennej

Najčastejší “čistý” zásah je vynechať regresor, ktorý spôsobuje multikolinearitu. Tu skúsime vynechať PowerConsumption_Zone2 a porovnať výsledok.

model_noZone2 <- lm(
  PowerConsumption_Zone1 ~ Temperature + Humidity + WindSpeed +
    GeneralDiffuseFlows + DiffuseFlows,
  data = udaje
)
summary(model_noZone2)
## 
## Call:
## lm(formula = PowerConsumption_Zone1 ~ Temperature + Humidity + 
##     WindSpeed + GeneralDiffuseFlows + DiffuseFlows, data = udaje)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15613.5  -4933.0   -653.3   4113.8  17982.7 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          2.675e+04  2.138e+02 125.132   <2e-16 ***
## Temperature          5.349e+02  6.418e+00  83.331   <2e-16 ***
## Humidity            -5.651e+01  2.130e+00 -26.525   <2e-16 ***
## WindSpeed           -1.487e+02  1.358e+01 -10.951   <2e-16 ***
## GeneralDiffuseFlows -1.702e+00  1.464e-01 -11.628   <2e-16 ***
## DiffuseFlows        -8.731e-02  2.721e-01  -0.321    0.748    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6349 on 52410 degrees of freedom
## Multiple R-squared:  0.2073, Adjusted R-squared:  0.2072 
## F-statistic:  2741 on 5 and 52410 DF,  p-value: < 2.2e-16
vif(model_noZone2)
##         Temperature            Humidity           WindSpeed GeneralDiffuseFlows 
##            1.811650            1.427287            1.322804            1.947516 
##        DiffuseFlows 
##            1.484903

7.2 Škálovanie (štandardizácia) vysvetľujúcich premenných

Škálovanie nemení “informatívnosť” dát, ale výrazne pomôže pri numerickej stabilite a pri Condition number. Koeficienty sa potom interpretujú v štandardizovaných jednotkách (menej intuitívne v pôvodných jednotkách).

udaje_scaled <- udaje

cols_to_scale <- c("Temperature", "Humidity", "WindSpeed",
                   "GeneralDiffuseFlows", "DiffuseFlows",
                   "PowerConsumption_Zone2")

udaje_scaled[cols_to_scale] <- scale(udaje_scaled[cols_to_scale], center = TRUE, scale = TRUE)

model_scaled <- lm(
  PowerConsumption_Zone1 ~ Temperature + Humidity + WindSpeed +
    GeneralDiffuseFlows + DiffuseFlows + PowerConsumption_Zone2,
  data = udaje_scaled
)

summary(model_scaled)
## 
## Call:
## lm(formula = PowerConsumption_Zone1 ~ Temperature + Humidity + 
##     WindSpeed + GeneralDiffuseFlows + DiffuseFlows + PowerConsumption_Zone2, 
##     data = udaje_scaled)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17249.8  -2493.9     71.9   2444.0  15359.2 
## 
## Coefficients:
##                        Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)            32344.97      16.65 1943.198  < 2e-16 ***
## Temperature             1120.13      23.33   48.002  < 2e-16 ***
## Humidity                  96.70      20.14    4.801 1.58e-06 ***
## WindSpeed               -131.52      19.16   -6.865 6.72e-12 ***
## GeneralDiffuseFlows      -90.12      23.26   -3.875 0.000107 ***
## DiffuseFlows             178.80      20.29    8.811  < 2e-16 ***
## PowerConsumption_Zone2  5576.18      18.28  305.066  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3811 on 52409 degrees of freedom
## Multiple R-squared:  0.7144, Adjusted R-squared:  0.7144 
## F-statistic: 2.185e+04 on 6 and 52409 DF,  p-value: < 2.2e-16
vif(model_scaled)
##            Temperature               Humidity              WindSpeed 
##               1.965281               1.464191               1.324643 
##    GeneralDiffuseFlows           DiffuseFlows PowerConsumption_Zone2 
##               1.952539               1.486298               1.205864
X <- model.matrix(model_scaled)[, -1]
XtX <- t(X) %*% X
eig <- eigen(XtX)
condition_number_scaled <- sqrt(max(eig$values) / min(eig$values))
condition_number_scaled
## [1] 2.755976

7.3 Jednoduchá zmena jednotiek (zachová interpretáciu)

Ak je problém hlavne v mierke (jednotkách), dá sa prepočítať len konkrétna premenná. Tu napríklad prepočítame spotrebu v Zone2 na “tisíce jednotiek” (len príklad – reálne jednotky závisia od datasetu).

udaje_unit <- udaje
udaje_unit$Zone2_1000 <- udaje_unit$PowerConsumption_Zone2 / 1000

model_unit <- lm(
  PowerConsumption_Zone1 ~ Temperature + Humidity + WindSpeed +
    GeneralDiffuseFlows + DiffuseFlows + Zone2_1000,
  data = udaje_unit
)

summary(model_unit)
## 
## Call:
## lm(formula = PowerConsumption_Zone1 ~ Temperature + Humidity + 
##     WindSpeed + GeneralDiffuseFlows + DiffuseFlows + Zone2_1000, 
##     data = udaje_unit)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17249.8  -2493.9     71.9   2444.0  15359.2 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         5803.08166  145.53221  39.875  < 2e-16 ***
## Temperature          192.61151    4.01255  48.002  < 2e-16 ***
## Humidity               6.21787    1.29518   4.801 1.58e-06 ***
## WindSpeed            -55.99210    8.15616  -6.865 6.72e-12 ***
## GeneralDiffuseFlows   -0.34084    0.08797  -3.875 0.000107 ***
## DiffuseFlows           1.43949    0.16338   8.811  < 2e-16 ***
## Zone2_1000          1072.04024    3.51412 305.066  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3811 on 52409 degrees of freedom
## Multiple R-squared:  0.7144, Adjusted R-squared:  0.7144 
## F-statistic: 2.185e+04 on 6 and 52409 DF,  p-value: < 2.2e-16
vif(model_unit)
##         Temperature            Humidity           WindSpeed GeneralDiffuseFlows 
##            1.965281            1.464191            1.324643            1.952539 
##        DiffuseFlows          Zone2_1000 
##            1.486298            1.205864
X <- model.matrix(model_unit)[, -1]
XtX <- t(X) %*% X
eig <- eigen(XtX)
condition_number_unit <- sqrt(max(eig$values) / min(eig$values))
condition_number_unit
## [1] 169.4135

7.4 PCA (voliteľné)

PCA vytvorí komponent (kombináciu) z korelovaných premenných a tým vie znížiť multikolinearitu. Nižšie spravíme PCA z dvojice GeneralDiffuseFlows a DiffuseFlows a použijeme prvú hlavnú komponentu.

X_pca <- scale(udaje[, c("GeneralDiffuseFlows", "DiffuseFlows")], center = TRUE, scale = TRUE)
pca_res <- prcomp(X_pca)

summary(pca_res)
## Importance of components:
##                           PC1    PC2
## Standard deviation     1.2509 0.6598
## Proportion of Variance 0.7824 0.2176
## Cumulative Proportion  0.7824 1.0000
udaje$PC1_flows <- pca_res$x[, 1]

model_pca <- lm(
  PowerConsumption_Zone1 ~ Temperature + Humidity + WindSpeed + PC1_flows + PowerConsumption_Zone2,
  data = udaje
)

summary(model_pca)
## 
## Call:
## lm(formula = PowerConsumption_Zone1 ~ Temperature + Humidity + 
##     WindSpeed + PC1_flows + PowerConsumption_Zone2, data = udaje)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17417.7  -2482.1     65.5   2439.3  15346.9 
## 
## Coefficients:
##                          Estimate Std. Error t value Pr(>|t|)    
## (Intercept)             5.853e+03  1.436e+02  40.765  < 2e-16 ***
## Temperature             1.869e+02  3.932e+00  47.527  < 2e-16 ***
## Humidity                7.646e+00  1.280e+00   5.973 2.34e-09 ***
## WindSpeed              -5.643e+01  8.160e+00  -6.916 4.70e-12 ***
## PC1_flows               7.979e+01  1.511e+01   5.282 1.28e-07 ***
## PowerConsumption_Zone2  1.072e+00  3.515e-03 305.060  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3813 on 52410 degrees of freedom
## Multiple R-squared:  0.7141, Adjusted R-squared:  0.7141 
## F-statistic: 2.619e+04 on 5 and 52410 DF,  p-value: < 2.2e-16
vif(model_pca)
##            Temperature               Humidity              WindSpeed 
##               1.885141               1.428806               1.324566 
##              PC1_flows PowerConsumption_Zone2 
##               1.287453               1.205603

7.5 Ridge Regression (voliteľné)

Ridge regresia stabilizuje odhady pomocou penalizácie. Koeficienty sú potom skreslené (biased), ale často s nižšou variabilitou.

library(MASS)

ridge_df <- udaje[, c("PowerConsumption_Zone1",
                      "Temperature", "Humidity", "WindSpeed",
                      "GeneralDiffuseFlows", "DiffuseFlows",
                      "PowerConsumption_Zone2")]

# Ridge sa správa lepšie na škálovaných premenných
ridge_df[, -1] <- scale(ridge_df[, -1], center = TRUE, scale = TRUE)

ridge_fit <- lm.ridge(
  PowerConsumption_Zone1 ~ .,
  data = ridge_df,
  lambda = seq(0, 50, 5)
)

ridge_fit
##             Temperature Humidity WindSpeed GeneralDiffuseFlows DiffuseFlows
##  0 32344.97    1120.128 96.69522 -131.5177           -90.11784     178.8008
##  5 32344.97    1120.127 96.52650 -131.4471           -90.06920     178.7398
## 10 32344.97    1120.127 96.35786 -131.3765           -90.02056     178.6787
## 15 32344.97    1120.126 96.18928 -131.3060           -89.97193     178.6177
## 20 32344.97    1120.126 96.02077 -131.2354           -89.92331     178.5568
## 25 32344.97    1120.126 95.85233 -131.1649           -89.87469     178.4958
## 30 32344.97    1120.125 95.68396 -131.0944           -89.82608     178.4349
## 35 32344.97    1120.125 95.51565 -131.0239           -89.77747     178.3740
## 40 32344.97    1120.124 95.34742 -130.9534           -89.72887     178.3132
## 45 32344.97    1120.124 95.17925 -130.8829           -89.68028     178.2523
## 50 32344.97    1120.123 95.01115 -130.8124           -89.63169     178.1915
##    PowerConsumption_Zone2
##  0               5576.181
##  5               5575.584
## 10               5574.987
## 15               5574.391
## 20               5573.795
## 25               5573.198
## 30               5572.602
## 35               5572.007
## 40               5571.411
## 45               5570.815
## 50               5570.220
plot(ridge_fit)


8. Zhrnutie