Použijeme databázu spotreby elektriny (2017, 10-minútové intervaly) s meteorologickými premennými. Cieľ: vysvetliť spotrebu elektriny v zóne 1 pomocou počasia a časových efektov.

library(zoo)
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(tseries)
## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo
library(lmtest)
library(sandwich)
library(car)
## Loading required package: carData
rm(list = ls())

Úvod do problému a hypotézy

Budeme modelovať PowerConsumption_Zone1 (spotreba v zóne 1) v závislosti od: - teploty, vlhkosti, rýchlosti vetra, - difúzneho slnečného žiarenia, - časových efektov (hodina dňa, deň v týždni).

Predpokladáme, že časové premenné zachytia pravidelný denný a týždenný cyklus spotreby elektriny a meteorologické premenné budú mať dodatočný vysvetľujúci vplyv.

Príprava databázy

udaje <- read.csv("powerconsumption.csv", sep = ",", header = TRUE)
udaje$Datetime <- as.POSIXct(udaje$Datetime, format = "%m/%d/%Y %H:%M")

str(udaje)
## 'data.frame':    52416 obs. of  9 variables:
##  $ Datetime              : POSIXct, format: "2017-01-01 00:00:00" "2017-01-01 00:10:00" ...
##  $ Temperature           : num  6.56 6.41 6.31 6.12 5.92 ...
##  $ Humidity              : num  73.8 74.5 74.5 75 75.7 76.9 77.7 78.2 78.1 77.3 ...
##  $ WindSpeed             : num  0.083 0.083 0.08 0.083 0.081 0.081 0.08 0.085 0.081 0.082 ...
##  $ GeneralDiffuseFlows   : num  0.051 0.07 0.062 0.091 0.048 0.059 0.048 0.055 0.066 0.062 ...
##  $ DiffuseFlows          : num  0.119 0.085 0.1 0.096 0.085 0.108 0.096 0.093 0.141 0.111 ...
##  $ PowerConsumption_Zone1: num  34056 29815 29128 28229 27336 ...
##  $ PowerConsumption_Zone2: num  16129 19375 19007 18361 17872 ...
##  $ PowerConsumption_Zone3: num  20241 20131 19668 18899 18442 ...
summary(udaje)
##     Datetime                    Temperature        Humidity       WindSpeed    
##  Min.   :2017-01-01 00:00:00   Min.   : 3.247   Min.   :11.34   Min.   :0.050  
##  1st Qu.:2017-04-01 23:57:30   1st Qu.:14.410   1st Qu.:58.31   1st Qu.:0.078  
##  Median :2017-07-01 23:55:00   Median :18.780   Median :69.86   Median :0.086  
##  Mean   :2017-07-01 23:55:00   Mean   :18.810   Mean   :68.26   Mean   :1.959  
##  3rd Qu.:2017-09-30 23:52:30   3rd Qu.:22.890   3rd Qu.:81.40   3rd Qu.:4.915  
##  Max.   :2017-12-30 23:50:00   Max.   :40.010   Max.   :94.80   Max.   :6.483  
##  GeneralDiffuseFlows  DiffuseFlows     PowerConsumption_Zone1
##  Min.   :   0.004    Min.   :  0.011   Min.   :13896         
##  1st Qu.:   0.062    1st Qu.:  0.122   1st Qu.:26311         
##  Median :   5.035    Median :  4.456   Median :32266         
##  Mean   : 182.697    Mean   : 75.028   Mean   :32345         
##  3rd Qu.: 319.600    3rd Qu.:101.000   3rd Qu.:37309         
##  Max.   :1163.000    Max.   :936.000   Max.   :52204         
##  PowerConsumption_Zone2 PowerConsumption_Zone3
##  Min.   : 8560          Min.   : 5935         
##  1st Qu.:16981          1st Qu.:13129         
##  Median :20823          Median :16415         
##  Mean   :21043          Mean   :17835         
##  3rd Qu.:24714          3rd Qu.:21624         
##  Max.   :37409          Max.   :47598

Výber premenných a časové efekty

udaje_model <- udaje[, c("Datetime",
                         "PowerConsumption_Zone1",
                         "Temperature", "Humidity", "WindSpeed",
                         "GeneralDiffuseFlows", "DiffuseFlows")]

udaje_model$Hour <- as.integer(format(udaje_model$Datetime, "%H"))
udaje_model$DayOfWeek <- factor(format(udaje_model$Datetime, "%a"),
                                levels = c("Mon","Tue","Wed","Thu","Fri","Sat","Sun"))

udaje_model <- subset(udaje_model,
                      Datetime >= as.POSIXct("2017-06-01 00:00:00") &
                      Datetime <= as.POSIXct("2017-06-30 23:50:00"))

colSums(is.na(udaje_model))
##               Datetime PowerConsumption_Zone1            Temperature 
##                      0                      0                      0 
##               Humidity              WindSpeed    GeneralDiffuseFlows 
##                      0                      0                      0 
##           DiffuseFlows                   Hour              DayOfWeek 
##                      0                      0                      0

Boxploty

vars <- c("PowerConsumption_Zone1", "Temperature", "Humidity", "WindSpeed",
          "GeneralDiffuseFlows", "DiffuseFlows")

par(mfrow = c(2, 3))
for (v in vars) {
  boxplot(udaje_model[[v]], main = v, col = "lightblue")
}

par(mfrow = c(1, 1))

Lineárna regresia

model1 <- lm(PowerConsumption_Zone1 ~ Temperature + Humidity + WindSpeed +
               GeneralDiffuseFlows + DiffuseFlows +
               factor(Hour) + DayOfWeek,
             data = udaje_model)

summary(model1)
## 
## Call:
## lm(formula = PowerConsumption_Zone1 ~ Temperature + Humidity + 
##     WindSpeed + GeneralDiffuseFlows + DiffuseFlows + factor(Hour) + 
##     DayOfWeek, data = udaje_model)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13732.4  -1228.7     19.2   1470.4   7187.2 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          4.770e+04  6.895e+02  69.176  < 2e-16 ***
## Temperature         -1.875e+02  2.804e+01  -6.688 2.56e-11 ***
## Humidity            -5.302e+01  3.352e+00 -15.818  < 2e-16 ***
## WindSpeed           -2.844e+02  1.919e+01 -14.819  < 2e-16 ***
## GeneralDiffuseFlows  1.980e+00  3.552e-01   5.574 2.64e-08 ***
## DiffuseFlows        -3.643e+00  4.106e-01  -8.872  < 2e-16 ***
## factor(Hour)1       -1.324e+03  2.567e+02  -5.157 2.62e-07 ***
## factor(Hour)2       -8.817e+02  2.570e+02  -3.431 0.000607 ***
## factor(Hour)3       -6.175e+03  2.577e+02 -23.965  < 2e-16 ***
## factor(Hour)4       -1.199e+04  2.583e+02 -46.409  < 2e-16 ***
## factor(Hour)5       -1.655e+04  2.589e+02 -63.910  < 2e-16 ***
## factor(Hour)6       -1.698e+04  2.591e+02 -65.529  < 2e-16 ***
## factor(Hour)7       -1.590e+04  2.661e+02 -59.733  < 2e-16 ***
## factor(Hour)8       -1.381e+04  3.006e+02 -45.941  < 2e-16 ***
## factor(Hour)9       -1.144e+04  3.474e+02 -32.920  < 2e-16 ***
## factor(Hour)10      -9.447e+03  3.696e+02 -25.562  < 2e-16 ***
## factor(Hour)11      -7.655e+03  3.712e+02 -20.620  < 2e-16 ***
## factor(Hour)12      -6.561e+03  3.795e+02 -17.288  < 2e-16 ***
## factor(Hour)13      -5.888e+03  3.831e+02 -15.367  < 2e-16 ***
## factor(Hour)14      -5.350e+03  3.668e+02 -14.587  < 2e-16 ***
## factor(Hour)15      -5.289e+03  3.488e+02 -15.165  < 2e-16 ***
## factor(Hour)16      -5.390e+03  3.294e+02 -16.364  < 2e-16 ***
## factor(Hour)17      -4.018e+03  3.166e+02 -12.690  < 2e-16 ***
## factor(Hour)18      -1.297e+03  3.077e+02  -4.213 2.57e-05 ***
## factor(Hour)19       3.778e+03  2.749e+02  13.741  < 2e-16 ***
## factor(Hour)20       5.556e+03  2.634e+02  21.096  < 2e-16 ***
## factor(Hour)21       5.157e+03  2.595e+02  19.876  < 2e-16 ***
## factor(Hour)22       4.926e+03  2.576e+02  19.124  < 2e-16 ***
## factor(Hour)23       2.986e+03  2.568e+02  11.629  < 2e-16 ***
## DayOfWeekTue        -4.510e+02  1.446e+02  -3.119 0.001829 ** 
## DayOfWeekWed         1.452e+02  1.480e+02   0.981 0.326639    
## DayOfWeekThu         1.139e+02  1.419e+02   0.803 0.421980    
## DayOfWeekFri         3.682e+02  1.414e+02   2.603 0.009275 ** 
## DayOfWeekSat         1.329e+03  1.445e+02   9.197  < 2e-16 ***
## DayOfWeekSun        -7.597e+02  1.440e+02  -5.276 1.38e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2435 on 4285 degrees of freedom
## Multiple R-squared:  0.8902, Adjusted R-squared:  0.8893 
## F-statistic:  1021 on 34 and 4285 DF,  p-value: < 2.2e-16

Diagnostika modelu

par(mfrow = c(2, 2))
plot(model1)

par(mfrow = c(1, 1))
res1 <- residuals(model1)
jarque.bera.test(res1)
## 
##  Jarque Bera Test
## 
## data:  res1
## X-squared = 1187.6, df = 2, p-value < 2.2e-16
outlierTest(model1)
##        rstudent unadjusted p-value Bonferroni p
## 25459 -5.685671         1.3897e-08   6.0033e-05
## 25460 -5.421674         6.2293e-08   2.6910e-04
## 25603 -5.119478         3.1982e-07   1.3816e-03
## 25461 -5.079365         3.9480e-07   1.7056e-03
## 25604 -4.874049         1.1329e-06   4.8941e-03
## 25462 -4.709636         2.5601e-06   1.1060e-02

Upravený model s log-transformáciou

model2 <- lm(PowerConsumption_Zone1 ~ Temperature + Humidity + WindSpeed +
               I(log1p(GeneralDiffuseFlows)) + I(log1p(DiffuseFlows)) +
               factor(Hour) + DayOfWeek,
             data = udaje_model)

summary(model2)
## 
## Call:
## lm(formula = PowerConsumption_Zone1 ~ Temperature + Humidity + 
##     WindSpeed + I(log1p(GeneralDiffuseFlows)) + I(log1p(DiffuseFlows)) + 
##     factor(Hour) + DayOfWeek, data = udaje_model)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -13673.6  -1229.7     23.5   1471.0   7188.5 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    47442.32     679.80  69.789  < 2e-16 ***
## Temperature                     -175.44      27.57  -6.363 2.18e-10 ***
## Humidity                         -52.36       3.36 -15.585  < 2e-16 ***
## WindSpeed                       -282.95      19.20 -14.737  < 2e-16 ***
## I(log1p(GeneralDiffuseFlows))    488.64      96.11   5.084 3.85e-07 ***
## I(log1p(DiffuseFlows))          -771.21      72.87 -10.583  < 2e-16 ***
## factor(Hour)1                  -1318.24     256.72  -5.135 2.95e-07 ***
## factor(Hour)2                   -872.53     256.95  -3.396 0.000691 ***
## factor(Hour)3                  -6164.95     257.63 -23.929  < 2e-16 ***
## factor(Hour)4                 -11968.97     258.20 -46.354  < 2e-16 ***
## factor(Hour)5                 -16515.61     259.14 -63.732  < 2e-16 ***
## factor(Hour)6                 -16439.43     337.58 -48.699  < 2e-16 ***
## factor(Hour)7                 -14934.54     453.37 -32.942  < 2e-16 ***
## factor(Hour)8                 -12797.71     520.90 -24.568  < 2e-16 ***
## factor(Hour)9                 -10443.56     556.63 -18.762  < 2e-16 ***
## factor(Hour)10                 -8394.69     570.70 -14.710  < 2e-16 ***
## factor(Hour)11                 -6356.62     575.51 -11.045  < 2e-16 ***
## factor(Hour)12                 -5197.98     577.35  -9.003  < 2e-16 ***
## factor(Hour)13                 -4516.20     580.21  -7.784 8.77e-15 ***
## factor(Hour)14                 -4052.55     571.55  -7.090 1.56e-12 ***
## factor(Hour)15                 -4106.16     562.12  -7.305 3.29e-13 ***
## factor(Hour)16                 -4350.30     550.03  -7.909 3.27e-15 ***
## factor(Hour)17                 -3012.67     534.24  -5.639 1.82e-08 ***
## factor(Hour)18                  -293.25     516.47  -0.568 0.570197    
## factor(Hour)19                  4786.46     451.68  10.597  < 2e-16 ***
## factor(Hour)20                  5986.12     307.41  19.473  < 2e-16 ***
## factor(Hour)21                  5125.05     259.31  19.764  < 2e-16 ***
## factor(Hour)22                  4904.81     257.53  19.045  < 2e-16 ***
## factor(Hour)23                  2979.92     256.78  11.605  < 2e-16 ***
## DayOfWeekTue                    -430.79     144.72  -2.977 0.002930 ** 
## DayOfWeekWed                     196.60     148.22   1.326 0.184779    
## DayOfWeekThu                     125.31     141.51   0.886 0.375922    
## DayOfWeekFri                     396.80     141.02   2.814 0.004920 ** 
## DayOfWeekSat                    1331.95     144.37   9.226  < 2e-16 ***
## DayOfWeekSun                    -771.53     143.96  -5.359 8.79e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2435 on 4285 degrees of freedom
## Multiple R-squared:  0.8902, Adjusted R-squared:  0.8893 
## F-statistic:  1021 on 34 and 4285 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(model2)

par(mfrow = c(1, 1))
res2 <- residuals(model2)
jarque.bera.test(res2)
## 
##  Jarque Bera Test
## 
## data:  res2
## X-squared = 1211.7, df = 2, p-value < 2.2e-16
outlierTest(model2)
##        rstudent unadjusted p-value Bonferroni p
## 25459 -5.661457         1.5991e-08   0.00006908
## 25460 -5.392039         7.3413e-08   0.00031715
## 25603 -5.105855         3.4360e-07   0.00148430
## 25461 -5.047628         4.6589e-07   0.00201260
## 25604 -4.861572         1.2063e-06   0.00521110
## 25462 -4.679678         2.9619e-06   0.01279600

Robustné štandardné chyby

coeftest(model2, vcov. = vcovHC(model2, type = "HC1"))
## 
## t test of coefficients:
## 
##                                  Estimate  Std. Error  t value  Pr(>|t|)    
## (Intercept)                    47442.3217    598.7108  79.2408 < 2.2e-16 ***
## Temperature                     -175.4384     24.1693  -7.2587 4.617e-13 ***
## Humidity                         -52.3578      2.9974 -17.4679 < 2.2e-16 ***
## WindSpeed                       -282.9507     22.4675 -12.5938 < 2.2e-16 ***
## I(log1p(GeneralDiffuseFlows))    488.6411     82.5838   5.9169 3.536e-09 ***
## I(log1p(DiffuseFlows))          -771.2077     66.4586 -11.6043 < 2.2e-16 ***
## factor(Hour)1                  -1318.2374    326.1968  -4.0412 5.410e-05 ***
## factor(Hour)2                   -872.5329    363.4400  -2.4008  0.016403 *  
## factor(Hour)3                  -6164.9508    330.7945 -18.6368 < 2.2e-16 ***
## factor(Hour)4                 -11968.9716    273.4080 -43.7770 < 2.2e-16 ***
## factor(Hour)5                 -16515.6105    239.4738 -68.9663 < 2.2e-16 ***
## factor(Hour)6                 -16439.4311    285.7244 -57.5360 < 2.2e-16 ***
## factor(Hour)7                 -14934.5411    372.4581 -40.0972 < 2.2e-16 ***
## factor(Hour)8                 -12797.7136    429.3244 -29.8090 < 2.2e-16 ***
## factor(Hour)9                 -10443.5591    453.2561 -23.0412 < 2.2e-16 ***
## factor(Hour)10                 -8394.6861    464.2972 -18.0804 < 2.2e-16 ***
## factor(Hour)11                 -6356.6240    467.1706 -13.6066 < 2.2e-16 ***
## factor(Hour)12                 -5197.9775    470.7731 -11.0414 < 2.2e-16 ***
## factor(Hour)13                 -4516.2052    484.7641  -9.3163 < 2.2e-16 ***
## factor(Hour)14                 -4052.5488    479.3877  -8.4536 < 2.2e-16 ***
## factor(Hour)15                 -4106.1628    470.6623  -8.7242 < 2.2e-16 ***
## factor(Hour)16                 -4350.3018    458.6032  -9.4860 < 2.2e-16 ***
## factor(Hour)17                 -3012.6718    448.9005  -6.7112 2.182e-11 ***
## factor(Hour)18                  -293.2530    457.3322  -0.6412  0.521411    
## factor(Hour)19                  4786.4585    411.4654  11.6327 < 2.2e-16 ***
## factor(Hour)20                  5986.1194    284.1544  21.0664 < 2.2e-16 ***
## factor(Hour)21                  5125.0494    263.6862  19.4362 < 2.2e-16 ***
## factor(Hour)22                  4904.8127    262.5919  18.6785 < 2.2e-16 ***
## factor(Hour)23                  2979.9242    285.9998  10.4193 < 2.2e-16 ***
## DayOfWeekTue                    -430.7847    170.4356  -2.5276  0.011522 *  
## DayOfWeekWed                     196.6000    158.8200   1.2379  0.215829    
## DayOfWeekThu                     125.3096    147.1084   0.8518  0.394363    
## DayOfWeekFri                     396.7987    150.0967   2.6436  0.008232 ** 
## DayOfWeekSat                    1331.9508    141.7264   9.3980 < 2.2e-16 ***
## DayOfWeekSun                    -771.5285    166.1685  -4.6430 3.536e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Záver

Spotreba elektriny je výrazne ovplyvnená časovými efektmi, pričom meteorologické premenné poskytujú dodatočnú vysvetľujúcu informáciu. Pri veľkom počte pozorovaní je potrebné interpretovať testy normality rezíduí spolu s diagnostickými grafmi.