Pri práci budeme používať tieto knižnice:
library(zoo)
library(tseries)
library(lmtest)
library(sandwich)
library(car)
library(ggplot2)
library(patchwork)
rm(list = ls())
Cieľom je vysvetliť spotrebu elektriny v zóne 1 (PowerConsumption_Zone1) pomocou meteorologických premenných a pravidelných časových efektov.
Budeme uvažovať tieto vysvetľujúce premenné:
Temperature (teplota)Humidity (vlhkosť)WindSpeed (rýchlosť vetra)GeneralDiffuseFlows, DiffuseFlows (difúzne
slnečné žiarenie)Pracovne očakávame, že časové premenné budú výrazne významné (zachytia denný/týždenný režim) a počasie prinesie dodatočnú vysvetľujúcu informáciu.
Dáta sú v súbore CSV. Predpokladáme, že je uložený v priečinku
udaje pod názvom powerconsumption.csv.
udaje <- read.csv("powerconsumption.csv", sep = ",", header = TRUE)
# Prevod časovej premennej
udaje$Datetime <- as.POSIXct(udaje$Datetime, format = "%m/%d/%Y %H:%M")
# Rýchla kontrola štruktúry dát
str(udaje)
'data.frame': 52416 obs. of 9 variables:
$ Datetime : POSIXct, format: "2017-01-01 00:00:00" "2017-01-01 00:10:00" ...
$ Temperature : num 6.56 6.41 6.31 6.12 5.92 ...
$ Humidity : num 73.8 74.5 74.5 75 75.7 76.9 77.7 78.2 78.1 77.3 ...
$ WindSpeed : num 0.083 0.083 0.08 0.083 0.081 0.081 0.08 0.085 0.081 0.082 ...
$ GeneralDiffuseFlows : num 0.051 0.07 0.062 0.091 0.048 0.059 0.048 0.055 0.066 0.062 ...
$ DiffuseFlows : num 0.119 0.085 0.1 0.096 0.085 0.108 0.096 0.093 0.141 0.111 ...
$ PowerConsumption_Zone1: num 34056 29815 29128 28229 27336 ...
$ PowerConsumption_Zone2: num 16129 19375 19007 18361 17872 ...
$ PowerConsumption_Zone3: num 20241 20131 19668 18899 18442 ...
summary(udaje)
Datetime Temperature Humidity
Min. :2017-01-01 00:00:00 Min. : 3.247 Min. :11.34
1st Qu.:2017-04-01 23:57:30 1st Qu.:14.410 1st Qu.:58.31
Median :2017-07-01 23:55:00 Median :18.780 Median :69.86
Mean :2017-07-01 23:55:00 Mean :18.810 Mean :68.26
3rd Qu.:2017-09-30 23:52:30 3rd Qu.:22.890 3rd Qu.:81.40
Max. :2017-12-30 23:50:00 Max. :40.010 Max. :94.80
WindSpeed GeneralDiffuseFlows DiffuseFlows
Min. :0.050 Min. : 0.004 Min. : 0.011
1st Qu.:0.078 1st Qu.: 0.062 1st Qu.: 0.122
Median :0.086 Median : 5.035 Median : 4.456
Mean :1.959 Mean : 182.697 Mean : 75.028
3rd Qu.:4.915 3rd Qu.: 319.600 3rd Qu.:101.000
Max. :6.483 Max. :1163.000 Max. :936.000
PowerConsumption_Zone1 PowerConsumption_Zone2 PowerConsumption_Zone3
Min. :13896 Min. : 8560 Min. : 5935
1st Qu.:26311 1st Qu.:16981 1st Qu.:13129
Median :32266 Median :20823 Median :16415
Mean :32345 Mean :21043 Mean :17835
3rd Qu.:37309 3rd Qu.:24714 3rd Qu.:21624
Max. :52204 Max. :37409 Max. :47598
Keďže ide o časový rad, pridáme: - hodinu dňa (Hour) -
deň v týždni (DayOfWeek)
Aby boli výstupy prehľadnejšie, v tomto cvičení budeme pracovať s jedným mesiacom (jún 2017). Ak chceš pracovať s celým rokom, filter jednoducho zakomentuj.
udaje_model <- udaje[, c("Datetime",
"PowerConsumption_Zone1",
"Temperature", "Humidity", "WindSpeed",
"GeneralDiffuseFlows", "DiffuseFlows")]
udaje_model$Hour <- as.integer(format(udaje_model$Datetime, "%H"))
udaje_model$DayOfWeek <- factor(format(udaje_model$Datetime, "%a"),
levels = c("Mon","Tue","Wed","Thu","Fri","Sat","Sun"))
# Filter na jún 2017 (voliteľné)
udaje_model <- subset(
udaje_model,
Datetime >= as.POSIXct("2017-06-01 00:00:00") &
Datetime <= as.POSIXct("2017-06-30 23:50:00")
)
# Kontrola chýbajúcich hodnôt
colSums(is.na(udaje_model))
Datetime PowerConsumption_Zone1 Temperature
0 0 0
Humidity WindSpeed GeneralDiffuseFlows
0 0 0
DiffuseFlows Hour DayOfWeek
0 0 0
V tejto databáze bývajú chýbajúce hodnoty zriedkavé. Napriek tomu je dobré mať korektný postup. Ak sa niekde vyskytne NA, doplníme ho mediánom danej numerickej premennej.
if (anyNA(udaje_model)) {
num_cols <- sapply(udaje_model, is.numeric)
med <- sapply(udaje_model[, num_cols, drop = FALSE], median, na.rm = TRUE)
for (cn in names(med)) {
udaje_model[[cn]][is.na(udaje_model[[cn]])] <- med[cn]
}
}
Boxploty nám pomôžu rýchlo identifikovať extrémne hodnoty alebo neštandardné rozdelenia.
vars <- c("PowerConsumption_Zone1", "Temperature", "Humidity", "WindSpeed",
"GeneralDiffuseFlows", "DiffuseFlows")
par(mfrow = c(2, 3))
par(mar = c(4, 4, 2, 1))
for (v in vars) {
boxplot(udaje_model[[v]],
main = v,
ylab = "Hodnota",
col = "lightblue")
}
par(mfrow = c(1, 1))
Odhadneme lineárny model pomocou lm().
V prvom modeli použijeme všetky meteorologické premenné a pridáme časové efekty (hodina a deň v týždni). Časové efekty modelujeme ako faktorové premenné, aby model vedel zachytiť rozdielne priemery v jednotlivých hodinách/dňoch.
model1 <- lm(
PowerConsumption_Zone1 ~ Temperature + Humidity + WindSpeed +
GeneralDiffuseFlows + DiffuseFlows +
factor(Hour) + DayOfWeek,
data = udaje_model
)
summary(model1)
Call:
lm(formula = PowerConsumption_Zone1 ~ Temperature + Humidity +
WindSpeed + GeneralDiffuseFlows + DiffuseFlows + factor(Hour) +
DayOfWeek, data = udaje_model)
Residuals:
Min 1Q Median 3Q Max
-13732.4 -1228.7 19.2 1470.4 7187.2
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.770e+04 6.895e+02 69.176 < 2e-16 ***
Temperature -1.875e+02 2.804e+01 -6.688 2.56e-11 ***
Humidity -5.302e+01 3.352e+00 -15.818 < 2e-16 ***
WindSpeed -2.844e+02 1.919e+01 -14.819 < 2e-16 ***
GeneralDiffuseFlows 1.980e+00 3.552e-01 5.574 2.64e-08 ***
DiffuseFlows -3.643e+00 4.106e-01 -8.872 < 2e-16 ***
factor(Hour)1 -1.324e+03 2.567e+02 -5.157 2.62e-07 ***
factor(Hour)2 -8.817e+02 2.570e+02 -3.431 0.000607 ***
factor(Hour)3 -6.175e+03 2.577e+02 -23.965 < 2e-16 ***
factor(Hour)4 -1.199e+04 2.583e+02 -46.409 < 2e-16 ***
factor(Hour)5 -1.655e+04 2.589e+02 -63.910 < 2e-16 ***
factor(Hour)6 -1.698e+04 2.591e+02 -65.529 < 2e-16 ***
factor(Hour)7 -1.590e+04 2.661e+02 -59.733 < 2e-16 ***
factor(Hour)8 -1.381e+04 3.006e+02 -45.941 < 2e-16 ***
factor(Hour)9 -1.144e+04 3.474e+02 -32.920 < 2e-16 ***
factor(Hour)10 -9.447e+03 3.696e+02 -25.562 < 2e-16 ***
factor(Hour)11 -7.655e+03 3.712e+02 -20.620 < 2e-16 ***
factor(Hour)12 -6.561e+03 3.795e+02 -17.288 < 2e-16 ***
factor(Hour)13 -5.888e+03 3.831e+02 -15.367 < 2e-16 ***
factor(Hour)14 -5.350e+03 3.668e+02 -14.587 < 2e-16 ***
factor(Hour)15 -5.289e+03 3.488e+02 -15.165 < 2e-16 ***
factor(Hour)16 -5.390e+03 3.294e+02 -16.364 < 2e-16 ***
factor(Hour)17 -4.018e+03 3.166e+02 -12.690 < 2e-16 ***
factor(Hour)18 -1.297e+03 3.077e+02 -4.213 2.57e-05 ***
factor(Hour)19 3.778e+03 2.749e+02 13.741 < 2e-16 ***
factor(Hour)20 5.556e+03 2.634e+02 21.096 < 2e-16 ***
factor(Hour)21 5.157e+03 2.595e+02 19.876 < 2e-16 ***
factor(Hour)22 4.926e+03 2.576e+02 19.124 < 2e-16 ***
factor(Hour)23 2.986e+03 2.568e+02 11.629 < 2e-16 ***
DayOfWeekTue -4.510e+02 1.446e+02 -3.119 0.001829 **
DayOfWeekWed 1.452e+02 1.480e+02 0.981 0.326639
DayOfWeekThu 1.139e+02 1.419e+02 0.803 0.421980
DayOfWeekFri 3.682e+02 1.414e+02 2.603 0.009275 **
DayOfWeekSat 1.329e+03 1.445e+02 9.197 < 2e-16 ***
DayOfWeekSun -7.597e+02 1.440e+02 -5.276 1.38e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2435 on 4285 degrees of freedom
Multiple R-squared: 0.8902, Adjusted R-squared: 0.8893
F-statistic: 1021 on 34 and 4285 DF, p-value: < 2.2e-16
Štandardné diagnostické grafy nám pomáhajú posúdiť: - nelinearity, - heteroskedasticitu, - odľahlé a vplyvné pozorovania, - približnú normalitu rezíduí.
par(mfrow = c(2, 2))
plot(model1)
par(mfrow = c(1, 1))
Pri veľkých vzorkách je bežné, že test normality vyjde ako „zamietnutý“ aj pri menších odchýlkach. Preto výsledok testu vždy interpretujeme spolu s Q-Q grafom.
res1 <- residuals(model1)
jarque.bera.test(res1)
Jarque Bera Test
data: res1
X-squared = 1187.6, df = 2, p-value < 2.2e-16
outlierTest(model1)
Premenné žiarenia (GeneralDiffuseFlows,
DiffuseFlows) bývajú výrazne asymetrické (mnoho nízkych
hodnôt, občas vysoké). Použijeme transformáciu log(1 + x)
(v R: log1p(x)), aby sme sa vyhli problému
log(0).
model2 <- lm(
PowerConsumption_Zone1 ~ Temperature + Humidity + WindSpeed +
I(log1p(GeneralDiffuseFlows)) + I(log1p(DiffuseFlows)) +
factor(Hour) + DayOfWeek,
data = udaje_model
)
summary(model2)
Call:
lm(formula = PowerConsumption_Zone1 ~ Temperature + Humidity +
WindSpeed + I(log1p(GeneralDiffuseFlows)) + I(log1p(DiffuseFlows)) +
factor(Hour) + DayOfWeek, data = udaje_model)
Residuals:
Min 1Q Median 3Q Max
-13673.6 -1229.7 23.5 1471.0 7188.5
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 47442.32 679.80 69.789 < 2e-16 ***
Temperature -175.44 27.57 -6.363 2.18e-10 ***
Humidity -52.36 3.36 -15.585 < 2e-16 ***
WindSpeed -282.95 19.20 -14.737 < 2e-16 ***
I(log1p(GeneralDiffuseFlows)) 488.64 96.11 5.084 3.85e-07 ***
I(log1p(DiffuseFlows)) -771.21 72.87 -10.583 < 2e-16 ***
factor(Hour)1 -1318.24 256.72 -5.135 2.95e-07 ***
factor(Hour)2 -872.53 256.95 -3.396 0.000691 ***
factor(Hour)3 -6164.95 257.63 -23.929 < 2e-16 ***
factor(Hour)4 -11968.97 258.20 -46.354 < 2e-16 ***
factor(Hour)5 -16515.61 259.14 -63.732 < 2e-16 ***
factor(Hour)6 -16439.43 337.58 -48.699 < 2e-16 ***
factor(Hour)7 -14934.54 453.37 -32.942 < 2e-16 ***
factor(Hour)8 -12797.71 520.90 -24.568 < 2e-16 ***
factor(Hour)9 -10443.56 556.63 -18.762 < 2e-16 ***
factor(Hour)10 -8394.69 570.70 -14.710 < 2e-16 ***
factor(Hour)11 -6356.62 575.51 -11.045 < 2e-16 ***
factor(Hour)12 -5197.98 577.35 -9.003 < 2e-16 ***
factor(Hour)13 -4516.20 580.21 -7.784 8.77e-15 ***
factor(Hour)14 -4052.55 571.55 -7.090 1.56e-12 ***
factor(Hour)15 -4106.16 562.12 -7.305 3.29e-13 ***
factor(Hour)16 -4350.30 550.03 -7.909 3.27e-15 ***
factor(Hour)17 -3012.67 534.24 -5.639 1.82e-08 ***
factor(Hour)18 -293.25 516.47 -0.568 0.570197
factor(Hour)19 4786.46 451.68 10.597 < 2e-16 ***
factor(Hour)20 5986.12 307.41 19.473 < 2e-16 ***
factor(Hour)21 5125.05 259.31 19.764 < 2e-16 ***
factor(Hour)22 4904.81 257.53 19.045 < 2e-16 ***
factor(Hour)23 2979.92 256.78 11.605 < 2e-16 ***
DayOfWeekTue -430.79 144.72 -2.977 0.002930 **
DayOfWeekWed 196.60 148.22 1.326 0.184779
DayOfWeekThu 125.31 141.51 0.886 0.375922
DayOfWeekFri 396.80 141.02 2.814 0.004920 **
DayOfWeekSat 1331.95 144.37 9.226 < 2e-16 ***
DayOfWeekSun -771.53 143.96 -5.359 8.79e-08 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 2435 on 4285 degrees of freedom
Multiple R-squared: 0.8902, Adjusted R-squared: 0.8893
F-statistic: 1021 on 34 and 4285 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(model2)
par(mfrow = c(1, 1))
res2 <- residuals(model2)
jarque.bera.test(res2)
Jarque Bera Test
data: res2
X-squared = 1211.7, df = 2, p-value < 2.2e-16
outlierTest(model2)
Heteroskedasticita (nekonštantný rozptyl rezíduí) spôsobuje, že
klasické štandardné chyby a t-testy môžu byť nespoľahlivé. Budeme ju
skúmať: 1. vizuálne – pomocou grafov štvorcov rezíduí, 2. formálne –
Breusch–Pagan testom (bptest).
Ako potenciálne problematické premenné v tejto databáze zmysluplne preveríme napríklad teplotu a vlhkosť, prípadne aj vyrovnané hodnoty (fitted).
tmp <- udaje_model
tmp$sr2 <- resid(model2)^2
tmp$fitted <- fitted(model2)
p1 <- ggplot(tmp, aes(x = Temperature, y = sr2)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", se = FALSE) +
labs(x = "Temperature", y = "Squared residuals", title = "Squared residuals vs Temperature")
p2 <- ggplot(tmp, aes(x = Humidity, y = sr2)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", se = FALSE) +
labs(x = "Humidity", y = "Squared residuals", title = "Squared residuals vs Humidity")
p1 + p2
ggplot(tmp, aes(x = fitted, y = sr2)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "loess", se = FALSE) +
labs(x = "Fitted values", y = "Squared residuals", title = "Squared residuals vs fitted values")
bptest(model1)
studentized Breusch-Pagan test
data: model1
BP = 1066.7, df = 34, p-value < 2.2e-16
bptest(model2)
studentized Breusch-Pagan test
data: model2
BP = 1064.2, df = 34, p-value < 2.2e-16
Aj keď heteroskedasticita nie je výrazná, v praxi sa často reportujú robustné štandardné chyby (White/HC). Nižšie uvádzame odhady pre model2 s HC1 korekciou.
coeftest(model2, vcov. = vcovHC(model2, type = "HC1"))
t test of coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 47442.3217 598.7108 79.2408 < 2.2e-16
Temperature -175.4384 24.1693 -7.2587 4.617e-13
Humidity -52.3578 2.9974 -17.4679 < 2.2e-16
WindSpeed -282.9507 22.4675 -12.5938 < 2.2e-16
I(log1p(GeneralDiffuseFlows)) 488.6411 82.5838 5.9169 3.536e-09
I(log1p(DiffuseFlows)) -771.2077 66.4586 -11.6043 < 2.2e-16
factor(Hour)1 -1318.2374 326.1968 -4.0412 5.410e-05
factor(Hour)2 -872.5329 363.4400 -2.4008 0.016403
factor(Hour)3 -6164.9508 330.7945 -18.6368 < 2.2e-16
factor(Hour)4 -11968.9716 273.4080 -43.7770 < 2.2e-16
factor(Hour)5 -16515.6105 239.4738 -68.9663 < 2.2e-16
factor(Hour)6 -16439.4311 285.7244 -57.5360 < 2.2e-16
factor(Hour)7 -14934.5411 372.4581 -40.0972 < 2.2e-16
factor(Hour)8 -12797.7136 429.3244 -29.8090 < 2.2e-16
factor(Hour)9 -10443.5591 453.2561 -23.0412 < 2.2e-16
factor(Hour)10 -8394.6861 464.2972 -18.0804 < 2.2e-16
factor(Hour)11 -6356.6240 467.1706 -13.6066 < 2.2e-16
factor(Hour)12 -5197.9775 470.7731 -11.0414 < 2.2e-16
factor(Hour)13 -4516.2052 484.7641 -9.3163 < 2.2e-16
factor(Hour)14 -4052.5488 479.3877 -8.4536 < 2.2e-16
factor(Hour)15 -4106.1628 470.6623 -8.7242 < 2.2e-16
factor(Hour)16 -4350.3018 458.6032 -9.4860 < 2.2e-16
factor(Hour)17 -3012.6718 448.9005 -6.7112 2.182e-11
factor(Hour)18 -293.2530 457.3322 -0.6412 0.521411
factor(Hour)19 4786.4585 411.4654 11.6327 < 2.2e-16
factor(Hour)20 5986.1194 284.1544 21.0664 < 2.2e-16
factor(Hour)21 5125.0494 263.6862 19.4362 < 2.2e-16
factor(Hour)22 4904.8127 262.5919 18.6785 < 2.2e-16
factor(Hour)23 2979.9242 285.9998 10.4193 < 2.2e-16
DayOfWeekTue -430.7847 170.4356 -2.5276 0.011522
DayOfWeekWed 196.6000 158.8200 1.2379 0.215829
DayOfWeekThu 125.3096 147.1084 0.8518 0.394363
DayOfWeekFri 396.7987 150.0967 2.6436 0.008232
DayOfWeekSat 1331.9508 141.7264 9.3980 < 2.2e-16
DayOfWeekSun -771.5285 166.1685 -4.6430 3.536e-06
(Intercept) ***
Temperature ***
Humidity ***
WindSpeed ***
I(log1p(GeneralDiffuseFlows)) ***
I(log1p(DiffuseFlows)) ***
factor(Hour)1 ***
factor(Hour)2 *
factor(Hour)3 ***
factor(Hour)4 ***
factor(Hour)5 ***
factor(Hour)6 ***
factor(Hour)7 ***
factor(Hour)8 ***
factor(Hour)9 ***
factor(Hour)10 ***
factor(Hour)11 ***
factor(Hour)12 ***
factor(Hour)13 ***
factor(Hour)14 ***
factor(Hour)15 ***
factor(Hour)16 ***
factor(Hour)17 ***
factor(Hour)18
factor(Hour)19 ***
factor(Hour)20 ***
factor(Hour)21 ***
factor(Hour)22 ***
factor(Hour)23 ***
DayOfWeekTue *
DayOfWeekWed
DayOfWeekThu
DayOfWeekFri **
DayOfWeekSat ***
DayOfWeekSun ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Model spotreby elektriny dáva zmysel, ak popri meteorologických
premenných zohľadníme aj časové efekty (hodina dňa, deň v týždni).
Transformácia log(1 + x) pri premenných žiarenia pomáha
stabilizovať rozdelenie a znižovať vplyv extrémnych hodnôt. Pri
interpretácii normality rezíduí treba brať do úvahy veľkosť vzorky a
opierať sa aj o diagnostické grafy. Ak by sa objavila
heteroskedasticita, vhodným riešením je reportovať robustné štandardné
chyby.