Instructions: Part B consists of a simple dataset of residential power usage for January 1998 until December 2013. Your assignment is to model these data and a monthly forecast for 2014. The data is given in a single file. The variable ‘KWH’ is power consumption in Kilowatt hours, the rest is straight forward. Add these to your existing files above - clearly labeled.
Explore data.
library (readr)
power_data <- read_csv("https://raw.githubusercontent.com/wheremagichappens/an.dy/master/data624/ResidentialCustomerForecastLoad-624.csv")
Transformed data into time-series with freq - 12.
ts_data <- ts(power_data$KWH, frequency = 12, start = c(1998,1))
Box-Ljung test shows the series is not white noise (non-stationary with a weak positive trend and strong seasonality). 2008-Sep is missing and it was handled by mean imputation of all Septembers. On Jul 2010, we see that KWH suddenly drops dramatically (indeed outlier) - it could be due to input error but we are not so sure so we will keep it. During summer and winter time, we see the usage is usually higher. Seasonplot and ggAcf show that seasonality is pretty much consistent every year.
# Missing data detected
ts_data
FALSE Jan Feb Mar Apr May Jun Jul
FALSE 1998 6862583 5838198 5420658 5010364 4665377 6467147 8914755
FALSE 1999 7183759 5759262 4847656 5306592 4426794 5500901 7444416
FALSE 2000 7068296 5876083 4807961 4873080 5050891 7092865 6862662
FALSE 2001 7538529 6602448 5779180 4835210 4787904 6283324 7855129
FALSE 2002 7099063 6413429 5839514 5371604 5439166 5850383 7039702
FALSE 2003 7256079 6190517 6120626 4885643 5296096 6051571 6900676
FALSE 2004 7584596 6560742 6526586 4831688 4878262 6421614 7307931
FALSE 2005 8225477 6564338 5581725 5563071 4453983 5900212 8337998
FALSE 2006 7793358 5914945 5819734 5255988 4740588 7052275 7945564
FALSE 2007 8031295 7928337 6443170 4841979 4862847 5022647 6426220
FALSE 2008 7964293 7597060 6085644 5352359 4608528 6548439 7643987
FALSE 2009 8072330 6976800 5691452 5531616 5264439 5804433 7713260
FALSE 2010 9397357 8390677 7347915 5776131 4919289 6696292 770523
FALSE 2011 8394747 8898062 6356903 5685227 5506308 8037779 10093343
FALSE 2012 8991267 7952204 6356961 5569828 5783598 7926956 8886851
FALSE 2013 10655730 7681798 6517514 6105359 5940475 7920627 8415321
FALSE Aug Sep Oct Nov Dec
FALSE 1998 8607428 6989888 6345620 4640410 4693479
FALSE 1999 7564391 7899368 5358314 4436269 4419229
FALSE 2000 7517830 8912169 5844352 5041769 6220334
FALSE 2001 8450717 7112069 5242535 4461979 5240995
FALSE 2002 8058748 8245227 5865014 4908979 5779958
FALSE 2003 8476499 7791791 5344613 4913707 5756193
FALSE 2004 7309774 6690366 5444948 4824940 5791208
FALSE 2005 7786659 7057213 6694523 4313019 6181548
FALSE 2006 8241110 7296355 5104799 4458429 6226214
FALSE 2007 7447146 7666970 5785964 4907057 6047292
FALSE 2008 8037137 NA 5101803 4555602 6442746
FALSE 2009 8350517 7583146 5566075 5339890 7089880
FALSE 2010 7922701 7819472 5875917 4800733 6152583
FALSE 2011 10308076 8943599 5603920 6154138 8273142
FALSE 2012 9612423 7559148 5576852 5731899 6609694
FALSE 2013 9080226 7968220 5759367 5769083 9606304
# Mean imputation - with September
sept <- subset(power_data, grepl("Sep", power_data$`YYYY-MMM`))[3]
sept_mean <- mean(sept$KWH, na.rm=TRUE)
# Apply mean to missing row
power_data$KWH[is.na(power_data$KWH) == TRUE] <- sept_mean
# Re-created ts
ts_data <- ts(power_data$KWH, frequency = 12, start = c(1998,1))
# general series plot
autoplot(ts_data)
# seasonal plot
ggseasonplot(ts_data)
# sub-seasonal plot
ggsubseriesplot(ts_data)
# STL decomposition
stl(ts_data, s.window = 'periodic') %>% autoplot()
# Autocorrelation
ggAcf(ts_data)
Box.test(ts_data, type = c("Ljung-Box"))
FALSE
FALSE Box-Ljung test
FALSE
FALSE data: ts_data
FALSE X-squared = 34.118, df = 1, p-value = 5.187e-09
# summary statistics
summary(ts_data)
FALSE Min. 1st Qu. Median Mean 3rd Qu. Max.
FALSE 770523 5434539 6314472 6508724 7649733 10655730
summary(power_data)
FALSE CaseSequence YYYY-MMM KWH
FALSE Min. :733.0 Length:192 Min. : 770523
FALSE 1st Qu.:780.8 Class :character 1st Qu.: 5434539
FALSE Median :828.5 Mode :character Median : 6314472
FALSE Mean :828.5 Mean : 6508724
FALSE 3rd Qu.:876.2 3rd Qu.: 7649733
FALSE Max. :924.0 Max. :10655730
# Boxplot
boxplot(ts_data~cycle(ts_data))
From residual test (Box-Ljung), we found that ets - MNM is not reliable predictor as residuals are not white noise. Other models are all valid as residuals are all white noise (p > 0.05 from checkresiduals()) and fairly normally distributed. We will compare Arima, ets - AAN and ets - AAdN from stl decomposition in terms of RMSE on test set in the next section. The models predict h = 12 (2014 Jan to Dec) based on data running from 1998 Jan to 2013 Dec. Based on RMSE calculations on training sets, ARIMA is the best model - the lowest RMSE with 823918.8, where as STL (no-demped) - ANN is the worst predictor. In the next section, we will see same holds true for RMSE on test set.
# auto.arima
arima_model <- auto.arima(ts_data)
# forecast values
arima_model <- forecast(arima_model, h=12)
# forecast plot
autoplot(arima_model) + autolayer(fitted(arima_model))
accuracy(arima_model)
FALSE ME RMSE MAE MPE MAPE MASE
FALSE Training set -25755.56 823918.8 489803.5 -5.518168 11.63252 0.7141674
FALSE ACF1
FALSE Training set 0.0130951
checkresiduals(arima_model)
FALSE
FALSE Ljung-Box test
FALSE
FALSE data: Residuals from ARIMA(0,0,1)(1,1,1)[12] with drift
FALSE Q* = 12.619, df = 20, p-value = 0.8931
FALSE
FALSE Model df: 4. Total lags used: 24
#stlf - etsmodel estimation --- A,N,N is chosen.
stl_ndemp <- stlf(ts_data, damped=FALSE, s.window = "periodic", robust=TRUE, h = 12)
# forecast plot
autoplot(stl_ndemp) + autolayer(fitted(stl_ndemp))
accuracy(stl_ndemp)
FALSE ME RMSE MAE MPE MAPE MASE
FALSE Training set 71088.13 837414.3 504911.4 -4.208005 11.91808 0.7361958
FALSE ACF1
FALSE Training set 0.2059222
checkresiduals(stl_ndemp)
FALSE
FALSE Ljung-Box test
FALSE
FALSE data: Residuals from STL + ETS(A,N,N)
FALSE Q* = 25.094, df = 22, p-value = 0.2926
FALSE
FALSE Model df: 2. Total lags used: 24
#stlf - etsmodel estimation --- M, Ad, N is chosen.
stl_demp <- stlf(ts_data, damped=TRUE, s.window = "periodic", robust=TRUE, h = 12)
# forecast plot
autoplot(stl_demp) + autolayer(fitted(stl_demp))
accuracy(stl_demp)
FALSE ME RMSE MAE MPE MAPE MASE
FALSE Training set 56414.59 837021.2 504406.6 -4.462232 11.95523 0.7354597
FALSE ACF1
FALSE Training set 0.205215
checkresiduals(stl_demp)
FALSE
FALSE Ljung-Box test
FALSE
FALSE data: Residuals from STL + ETS(A,Ad,N)
FALSE Q* = 23.407, df = 19, p-value = 0.2199
FALSE
FALSE Model df: 5. Total lags used: 24
# ETS models - MNM
ets_model <- ets(ts_data)
# forecast plot
autoplot(forecast(ets_model, h=12)) + autolayer(fitted(ets_model))
accuracy(ets_model)
FALSE ME RMSE MAE MPE MAPE MASE
FALSE Training set 62242.3 826660.2 500306.9 -4.334494 11.96305 0.7294821
FALSE ACF1
FALSE Training set 0.1734508
checkresiduals(ets_model)
FALSE
FALSE Ljung-Box test
FALSE
FALSE data: Residuals from ETS(M,N,M)
FALSE Q* = 25.272, df = 10, p-value = 0.004853
FALSE
FALSE Model df: 14. Total lags used: 24
Using Time series cross-validation, we compute RMSE on testset (h=12). We will pick the model with the lowest RMSE on testset as our final model. ARIMA is the worst predictor in terms of RMSE on test set (the highest RMSE) which shows that the model is seriously overfitted - low bias but very high variance. Surprisingly, STL (no-demped) - ANN, which was the worst predictor in terms of RMSE on training set, has the lowest RMSE on test set among all models. Since this is yearly forecast, tsCV(h = 12) would make sense.
Given that 4 models we created did not vary much in terms of RMSE on training, while STL - ANN has significantly lower RMSE on test set than ARIMA, we will choose STL - ANN as our final model.
arima_cv <- function(x, h){forecast(Arima(ts_data, order = c(0, 0, 1), seasonal = c(1, 1, 1), include.drift = TRUE), h=h)}
e <- tsCV(ts_data, arima_cv, h=12)
sqrt(mean(e^2, na.rm=TRUE))
FALSE [1] 2134944
e <- tsCV(ts_data, stlf, damped=FALSE, s.window = "periodic", robust=TRUE, h=12)
sqrt(mean(e^2, na.rm=TRUE))
FALSE [1] 1006075
e <- tsCV(ts_data, stlf, damped=TRUE, s.window = "periodic", robust=TRUE, h=12)
sqrt(mean(e^2, na.rm=TRUE))
FALSE [1] 1012071
From above, we found that ARIMA is the worst predictor and STL - AAN is the best model as RMSE on test set is the lowest, contradicting to its’ RMSE on train set. It comes down to the discussion of bias-variance trade off; overfitted model cannot generalize the outcome of predictions on unseen data well. We will, thus, select Model #2 as our final choice - fairly low bias with the lowest variance.