Multikolinearita znamená, že niektoré vysvetľujúce premenné v regresii sú medzi sebou silne (takmer lineárne) závislé. Odhady koeficientov metódou najmenších štvorcov zostávajú neskreslené, ale:
Budeme pracovať s dátami o spotrebe elektriny a počasí. Databáza obsahuje 10-minútové merania a tri zóny spotreby.
Budeme odhadovať model:
\[ PowerConsumption\_Zone1_t = \beta_0 + \beta_1 Temperature_t + \beta_2 Humidity_t + \beta_3 WindSpeed_t + \beta_4 GeneralDiffuseFlows_t + \beta_5 DiffuseFlows_t + \beta_6 PowerConsumption\_Zone2_t + u_t \]
Premennú PowerConsumption_Zone2 pridávame zámerne – je
silno korelovaná so Zone1, takže vie vytvoriť výraznú
multikolinearitu.
Predpoklad: súbor
powerconsumption.csvje v rovnakom priečinku ako tento.Rmd. Ak nie, uprav cestu vread.csv().
udaje <- read.csv("powerconsumption.csv", header = TRUE)
# Datetime si pre istotu konvertujeme na časový typ (nie je nutné pre regresiu, ale hodí sa pri kontrole)
udaje$Datetime <- as.POSIXct(udaje$Datetime, format = "%m/%d/%Y %H:%M")
str(udaje)
## 'data.frame': 52416 obs. of 9 variables:
## $ Datetime : POSIXct, format: "2017-01-01 00:00:00" "2017-01-01 00:10:00" ...
## $ Temperature : num 6.56 6.41 6.31 6.12 5.92 ...
## $ Humidity : num 73.8 74.5 74.5 75 75.7 76.9 77.7 78.2 78.1 77.3 ...
## $ WindSpeed : num 0.083 0.083 0.08 0.083 0.081 0.081 0.08 0.085 0.081 0.082 ...
## $ GeneralDiffuseFlows : num 0.051 0.07 0.062 0.091 0.048 0.059 0.048 0.055 0.066 0.062 ...
## $ DiffuseFlows : num 0.119 0.085 0.1 0.096 0.085 0.108 0.096 0.093 0.141 0.111 ...
## $ PowerConsumption_Zone1: num 34056 29815 29128 28229 27336 ...
## $ PowerConsumption_Zone2: num 16129 19375 19007 18361 17872 ...
## $ PowerConsumption_Zone3: num 20241 20131 19668 18899 18442 ...
summary(udaje)
## Datetime Temperature Humidity WindSpeed
## Min. :2017-01-01 00:00:00 Min. : 3.247 Min. :11.34 Min. :0.050
## 1st Qu.:2017-04-01 23:57:30 1st Qu.:14.410 1st Qu.:58.31 1st Qu.:0.078
## Median :2017-07-01 23:55:00 Median :18.780 Median :69.86 Median :0.086
## Mean :2017-07-01 23:55:00 Mean :18.810 Mean :68.26 Mean :1.959
## 3rd Qu.:2017-09-30 23:52:30 3rd Qu.:22.890 3rd Qu.:81.40 3rd Qu.:4.915
## Max. :2017-12-30 23:50:00 Max. :40.010 Max. :94.80 Max. :6.483
## GeneralDiffuseFlows DiffuseFlows PowerConsumption_Zone1
## Min. : 0.004 Min. : 0.011 Min. :13896
## 1st Qu.: 0.062 1st Qu.: 0.122 1st Qu.:26311
## Median : 5.035 Median : 4.456 Median :32266
## Mean : 182.697 Mean : 75.028 Mean :32345
## 3rd Qu.: 319.600 3rd Qu.:101.000 3rd Qu.:37309
## Max. :1163.000 Max. :936.000 Max. :52204
## PowerConsumption_Zone2 PowerConsumption_Zone3
## Min. : 8560 Min. : 5935
## 1st Qu.:16981 1st Qu.:13129
## Median :20823 Median :16415
## Mean :21043 Mean :17835
## 3rd Qu.:24714 3rd Qu.:21624
## Max. :37409 Max. :47598
model <- lm(
PowerConsumption_Zone1 ~ Temperature + Humidity + WindSpeed +
GeneralDiffuseFlows + DiffuseFlows + PowerConsumption_Zone2,
data = udaje
)
summary(model)
##
## Call:
## lm(formula = PowerConsumption_Zone1 ~ Temperature + Humidity +
## WindSpeed + GeneralDiffuseFlows + DiffuseFlows + PowerConsumption_Zone2,
## data = udaje)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17249.8 -2493.9 71.9 2444.0 15359.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.803e+03 1.455e+02 39.875 < 2e-16 ***
## Temperature 1.926e+02 4.013e+00 48.002 < 2e-16 ***
## Humidity 6.218e+00 1.295e+00 4.801 1.58e-06 ***
## WindSpeed -5.599e+01 8.156e+00 -6.865 6.72e-12 ***
## GeneralDiffuseFlows -3.408e-01 8.797e-02 -3.875 0.000107 ***
## DiffuseFlows 1.439e+00 1.634e-01 8.811 < 2e-16 ***
## PowerConsumption_Zone2 1.072e+00 3.514e-03 305.066 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3811 on 52409 degrees of freedom
## Multiple R-squared: 0.7144, Adjusted R-squared: 0.7144
## F-statistic: 2.185e+04 on 6 and 52409 DF, p-value: < 2.2e-16
Korelácie zachytávajú párové vzťahy. Pri multikolinearite nás zaujíma hlavne korelácia medzi vysvetľujúcimi premennými.
xvars <- udaje[, c("Temperature", "Humidity", "WindSpeed",
"GeneralDiffuseFlows", "DiffuseFlows",
"PowerConsumption_Zone2")]
round(cor(xvars), 3)
## Temperature Humidity WindSpeed GeneralDiffuseFlows
## Temperature 1.000 -0.460 0.477 0.460
## Humidity -0.460 1.000 -0.136 -0.468
## WindSpeed 0.477 -0.136 1.000 0.134
## GeneralDiffuseFlows 0.460 -0.468 0.134 1.000
## DiffuseFlows 0.197 -0.257 -0.001 0.565
## PowerConsumption_Zone2 0.382 -0.295 0.146 0.157
## DiffuseFlows PowerConsumption_Zone2
## Temperature 0.197 0.382
## Humidity -0.257 -0.295
## WindSpeed -0.001 0.146
## GeneralDiffuseFlows 0.565 0.157
## DiffuseFlows 1.000 0.045
## PowerConsumption_Zone2 0.045 1.000
pairs(xvars, main = "Scatterplotová matica – vysvetľujúce premenné")
VIF pre premennú \(x_j\):
\[ VIF_j = \frac{1}{1 - R_j^2} \]
Pravidlá (orientačne): - VIF > 5: problém môže byť relevantný, - VIF > 10: silná multikolinearita.
library(car)
vif(model)
## Temperature Humidity WindSpeed
## 1.965281 1.464191 1.324643
## GeneralDiffuseFlows DiffuseFlows PowerConsumption_Zone2
## 1.952539 1.486298 1.205864
Condition number je indikátor, ktorý vie zachytiť aj “cyklickú” multikolinearitu medzi viacerými premennými naraz. Pravidlo (orientačne): - < 10 nízka, - 10–30 mierna, - 30–100 silná, - > 100 veľmi vážna.
Dôležité: Condition number je citlivý na mierku premenných (jednotky), preto pri porovnávaní často pomáha škálovanie.
X <- model.matrix(model)[, -1] # bez interceptu
XtX <- t(X) %*% X
eig <- eigen(XtX)
condition_number <- sqrt(max(eig$values) / min(eig$values))
condition_number
## [1] 10756.87
Najčastejší “čistý” zásah je vynechať regresor, ktorý spôsobuje
multikolinearitu. Tu skúsime vynechať
PowerConsumption_Zone2 a porovnať výsledok.
model_noZone2 <- lm(
PowerConsumption_Zone1 ~ Temperature + Humidity + WindSpeed +
GeneralDiffuseFlows + DiffuseFlows,
data = udaje
)
summary(model_noZone2)
##
## Call:
## lm(formula = PowerConsumption_Zone1 ~ Temperature + Humidity +
## WindSpeed + GeneralDiffuseFlows + DiffuseFlows, data = udaje)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15613.5 -4933.0 -653.3 4113.8 17982.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.675e+04 2.138e+02 125.132 <2e-16 ***
## Temperature 5.349e+02 6.418e+00 83.331 <2e-16 ***
## Humidity -5.651e+01 2.130e+00 -26.525 <2e-16 ***
## WindSpeed -1.487e+02 1.358e+01 -10.951 <2e-16 ***
## GeneralDiffuseFlows -1.702e+00 1.464e-01 -11.628 <2e-16 ***
## DiffuseFlows -8.731e-02 2.721e-01 -0.321 0.748
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6349 on 52410 degrees of freedom
## Multiple R-squared: 0.2073, Adjusted R-squared: 0.2072
## F-statistic: 2741 on 5 and 52410 DF, p-value: < 2.2e-16
vif(model_noZone2)
## Temperature Humidity WindSpeed GeneralDiffuseFlows
## 1.811650 1.427287 1.322804 1.947516
## DiffuseFlows
## 1.484903
Škálovanie nemení “informatívnosť” dát, ale výrazne pomôže pri numerickej stabilite a pri Condition number. Koeficienty sa potom interpretujú v štandardizovaných jednotkách (menej intuitívne v pôvodných jednotkách).
udaje_scaled <- udaje
cols_to_scale <- c("Temperature", "Humidity", "WindSpeed",
"GeneralDiffuseFlows", "DiffuseFlows",
"PowerConsumption_Zone2")
udaje_scaled[cols_to_scale] <- scale(udaje_scaled[cols_to_scale], center = TRUE, scale = TRUE)
model_scaled <- lm(
PowerConsumption_Zone1 ~ Temperature + Humidity + WindSpeed +
GeneralDiffuseFlows + DiffuseFlows + PowerConsumption_Zone2,
data = udaje_scaled
)
summary(model_scaled)
##
## Call:
## lm(formula = PowerConsumption_Zone1 ~ Temperature + Humidity +
## WindSpeed + GeneralDiffuseFlows + DiffuseFlows + PowerConsumption_Zone2,
## data = udaje_scaled)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17249.8 -2493.9 71.9 2444.0 15359.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 32344.97 16.65 1943.198 < 2e-16 ***
## Temperature 1120.13 23.33 48.002 < 2e-16 ***
## Humidity 96.70 20.14 4.801 1.58e-06 ***
## WindSpeed -131.52 19.16 -6.865 6.72e-12 ***
## GeneralDiffuseFlows -90.12 23.26 -3.875 0.000107 ***
## DiffuseFlows 178.80 20.29 8.811 < 2e-16 ***
## PowerConsumption_Zone2 5576.18 18.28 305.066 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3811 on 52409 degrees of freedom
## Multiple R-squared: 0.7144, Adjusted R-squared: 0.7144
## F-statistic: 2.185e+04 on 6 and 52409 DF, p-value: < 2.2e-16
vif(model_scaled)
## Temperature Humidity WindSpeed
## 1.965281 1.464191 1.324643
## GeneralDiffuseFlows DiffuseFlows PowerConsumption_Zone2
## 1.952539 1.486298 1.205864
X <- model.matrix(model_scaled)[, -1]
XtX <- t(X) %*% X
eig <- eigen(XtX)
condition_number_scaled <- sqrt(max(eig$values) / min(eig$values))
condition_number_scaled
## [1] 2.755976
Ak je problém hlavne v mierke (jednotkách), dá sa prepočítať len konkrétna premenná. Tu napríklad prepočítame spotrebu v Zone2 na “tisíce jednotiek” (len príklad – reálne jednotky závisia od datasetu).
udaje_unit <- udaje
udaje_unit$Zone2_1000 <- udaje_unit$PowerConsumption_Zone2 / 1000
model_unit <- lm(
PowerConsumption_Zone1 ~ Temperature + Humidity + WindSpeed +
GeneralDiffuseFlows + DiffuseFlows + Zone2_1000,
data = udaje_unit
)
summary(model_unit)
##
## Call:
## lm(formula = PowerConsumption_Zone1 ~ Temperature + Humidity +
## WindSpeed + GeneralDiffuseFlows + DiffuseFlows + Zone2_1000,
## data = udaje_unit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17249.8 -2493.9 71.9 2444.0 15359.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5803.08166 145.53221 39.875 < 2e-16 ***
## Temperature 192.61151 4.01255 48.002 < 2e-16 ***
## Humidity 6.21787 1.29518 4.801 1.58e-06 ***
## WindSpeed -55.99210 8.15616 -6.865 6.72e-12 ***
## GeneralDiffuseFlows -0.34084 0.08797 -3.875 0.000107 ***
## DiffuseFlows 1.43949 0.16338 8.811 < 2e-16 ***
## Zone2_1000 1072.04024 3.51412 305.066 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3811 on 52409 degrees of freedom
## Multiple R-squared: 0.7144, Adjusted R-squared: 0.7144
## F-statistic: 2.185e+04 on 6 and 52409 DF, p-value: < 2.2e-16
vif(model_unit)
## Temperature Humidity WindSpeed GeneralDiffuseFlows
## 1.965281 1.464191 1.324643 1.952539
## DiffuseFlows Zone2_1000
## 1.486298 1.205864
X <- model.matrix(model_unit)[, -1]
XtX <- t(X) %*% X
eig <- eigen(XtX)
condition_number_unit <- sqrt(max(eig$values) / min(eig$values))
condition_number_unit
## [1] 169.4135
PCA vytvorí komponent (kombináciu) z korelovaných premenných a tým
vie znížiť multikolinearitu. Nižšie spravíme PCA z dvojice
GeneralDiffuseFlows a DiffuseFlows a použijeme
prvú hlavnú komponentu.
X_pca <- scale(udaje[, c("GeneralDiffuseFlows", "DiffuseFlows")], center = TRUE, scale = TRUE)
pca_res <- prcomp(X_pca)
summary(pca_res)
## Importance of components:
## PC1 PC2
## Standard deviation 1.2509 0.6598
## Proportion of Variance 0.7824 0.2176
## Cumulative Proportion 0.7824 1.0000
udaje$PC1_flows <- pca_res$x[, 1]
model_pca <- lm(
PowerConsumption_Zone1 ~ Temperature + Humidity + WindSpeed + PC1_flows + PowerConsumption_Zone2,
data = udaje
)
summary(model_pca)
##
## Call:
## lm(formula = PowerConsumption_Zone1 ~ Temperature + Humidity +
## WindSpeed + PC1_flows + PowerConsumption_Zone2, data = udaje)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17417.7 -2482.1 65.5 2439.3 15346.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 5.853e+03 1.436e+02 40.765 < 2e-16 ***
## Temperature 1.869e+02 3.932e+00 47.527 < 2e-16 ***
## Humidity 7.646e+00 1.280e+00 5.973 2.34e-09 ***
## WindSpeed -5.643e+01 8.160e+00 -6.916 4.70e-12 ***
## PC1_flows 7.979e+01 1.511e+01 5.282 1.28e-07 ***
## PowerConsumption_Zone2 1.072e+00 3.515e-03 305.060 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3813 on 52410 degrees of freedom
## Multiple R-squared: 0.7141, Adjusted R-squared: 0.7141
## F-statistic: 2.619e+04 on 5 and 52410 DF, p-value: < 2.2e-16
vif(model_pca)
## Temperature Humidity WindSpeed
## 1.885141 1.428806 1.324566
## PC1_flows PowerConsumption_Zone2
## 1.287453 1.205603
Ridge regresia stabilizuje odhady pomocou penalizácie. Koeficienty sú potom skreslené (biased), ale často s nižšou variabilitou.
library(MASS)
ridge_df <- udaje[, c("PowerConsumption_Zone1",
"Temperature", "Humidity", "WindSpeed",
"GeneralDiffuseFlows", "DiffuseFlows",
"PowerConsumption_Zone2")]
# Ridge sa správa lepšie na škálovaných premenných
ridge_df[, -1] <- scale(ridge_df[, -1], center = TRUE, scale = TRUE)
ridge_fit <- lm.ridge(
PowerConsumption_Zone1 ~ .,
data = ridge_df,
lambda = seq(0, 50, 5)
)
ridge_fit
## Temperature Humidity WindSpeed GeneralDiffuseFlows DiffuseFlows
## 0 32344.97 1120.128 96.69522 -131.5177 -90.11784 178.8008
## 5 32344.97 1120.127 96.52650 -131.4471 -90.06920 178.7398
## 10 32344.97 1120.127 96.35786 -131.3765 -90.02056 178.6787
## 15 32344.97 1120.126 96.18928 -131.3060 -89.97193 178.6177
## 20 32344.97 1120.126 96.02077 -131.2354 -89.92331 178.5568
## 25 32344.97 1120.126 95.85233 -131.1649 -89.87469 178.4958
## 30 32344.97 1120.125 95.68396 -131.0944 -89.82608 178.4349
## 35 32344.97 1120.125 95.51565 -131.0239 -89.77747 178.3740
## 40 32344.97 1120.124 95.34742 -130.9534 -89.72887 178.3132
## 45 32344.97 1120.124 95.17925 -130.8829 -89.68028 178.2523
## 50 32344.97 1120.123 95.01115 -130.8124 -89.63169 178.1915
## PowerConsumption_Zone2
## 0 5576.181
## 5 5575.584
## 10 5574.987
## 15 5574.391
## 20 5573.795
## 25 5573.198
## 30 5572.602
## 35 5572.007
## 40 5571.411
## 45 5570.815
## 50 5570.220
plot(ridge_fit)